introduction into the sequential and pipeline instruction...

42
Advanced Computer Architecture Introduction into the Sequential and Pipeline Instruction Execution Chapter 1 Martin Milata

Upload: others

Post on 22-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

Advanced Computer Architecture

Introduction into the Sequential and Pipeline Instruction Execution

Chapter 1

Martin Milata

Page 2: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

2

● Instruction Set Architecture (ISA)

● Defines an interface (or boundary) between software and hardware

● Describes class of ISA, addressing and access to a memory, operands a operations, controls the flow of instructions, encoding ISA

● Design and Organization

● Describes the high-level aspect of the processors design.

● Defines logical blocks or units that processors contain (ALU, FPU, General-purpose register, MMU, Controller, … )

● Hardware

● Includes detailed logic design of the processors – the low level aspect (expected process technology)

What is a Processors Architecture

Page 3: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

3

Evolution of the Microprocessor Architecture - ISA

● CISC and RISC architecture

● Today CISC instructions are usually translated internally on RISC-like instructions in the most x86 architectures

● Two critical performance techniques

● Instruction Level Parallelism (ILP) – pipelining and multiple instruction issue later

● Cache memory usage – from the simple to the hierarchical cache memory with sophistical organization and optimization

● Growth of processor performance today

● Using ILP, Cache memory and specialized processor units have brought ~50% performance growth per year for 16 years since 2002

● Today it is ~20% per year in the uniprocessor performance

● The future growth is focused on the usage of TLP and DLP techniques

Page 4: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

4

Evolution of the Microprocessor Architecture - Design

● Design and Organization

● From simple processors with single ALU to complex multi-core processors in these days

● CPU off-load on peripheral devices (for example graphic or network cards)

● Integration of peripheral devices on chip (northbridge, graphic card)

● Hardware evolution

● The first fully electronic digital computing device - Atanasoff–Berry Computer, which was made based on vacuum tube.

● The modern process technology 32nm respectively 22nm, which are used in productions today.

Page 5: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

5

A Reference RISC processor (1)

● A RISC architecture will be used to illustrate the basic concepts. The introduced ideas are of course applicable to the other processor architectures .

● A “typical” RISC processor

● 32-bit fixed format instruction● 32-bit general-purpose registers● Memory access only via Load and Store instructions with the

single addressing mode (based + displacement)● Simple branch conditions● Delayed branch

Page 6: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

6

A Reference RISC processor (2)

● Instruction types● Register-Register

● Register-Immediate

● Branch

● Jump/Call

● 5 stage of the RISC instruction

● Instruction fetch cycle (IF)

● Instruction decode/register fetch cycle (ID)

● Execution/effective address cycle (EX)

● Memory access (MEM)

● Write-back cycle (WB)

● Every stage of the instruction uses no shared resources of the processor (expect for MEM and IF)

Page 7: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

7

5 Stages of MIPS Datapath

Page 8: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

8

Sequential Unpipeline Instruction Execution

● It was used in the first computers.

● Processor architecture model is called sub-scalar architecture.

● Instructions are executed one by one without any overlap in sequential order.

● The same processing time for each instruction stage is not needed

● Some parts of the processor may not be used in some clock cycles.

● Amount of time, which is needed to execute the whole process, is sum of execution time of all instructions.

Tc=T.N.CPI ● Tc - Total execution time, T - Clock period, N - Total number of instructions, CPI - Clock per

instruction

● It is still used in very simple computers (like simple calculators)

Q: How many cycles require execution of these two instructions?

Page 9: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

9

Instruction Pipeline Execution

● The overlap of the execution time of instructions brings better usage of the processor units.

● Scalar processor architecture model

● It does not reduce the execution time of the instruction

● It increases the CPU instruction throughput

● Some modifications of hardware architecture for pipeline execution are needed

● Bypassing and/or short-circuiting

● Pipeline registers

● Pipeline overhead

● Setup time on pipeline registers

Q: Will it work correctly?

Page 10: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

10

MIPS Pipeline Datapath

Page 11: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

11

Pipeline Timing, Throughput and Latency

● Pipeline Timing

● Unpipeline execution– Time of each stage can be different

● Pipeline execution– Same execution

time for each stage

– Balanced pipeline

● Pipeline Throughput

● Count of completed instructions per second

● Pipeline Latency

● How long it takes to execute single instruction in pipeline.

Page 12: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

12

Pipeline Hazards

● Hazards prevent the execution of the next instruction in its designated clock cycle.

● They reduce speedup of pipeline execution

● There are three classes of hazards

● Structural hazards: arise from resource conflicts. Hardware may not support all combinations of instruction in overlapped execution.

● Data hazards: Instruction depends on result of previous instruction, which is not available when it is needed.

● Control hazards: May arise when the program counter is changed by the flow control instruction (branch).

Page 13: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

13

Structural Hazards● Example 1: Memory access of instruction fetch and memory

access of data, on the processor with single memory access channel.

● Instruction “OR” can not be loaded in cycle 4 due to structural hazard (there is not gate for access into the memory for getting a new instruction)

Page 14: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

14

Structural Hazards● Possible solution of structural hazard from Example 1

● Separate instruction and data memory (brings higher cost of CPU)

● Stall the pipeline for one cycle (decrease pipeline execution speedup)

Page 15: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

15

Data Hazard● RAW data hazard on register R1

● Hazard has impact on instructions sub, add and or. Result from the first instruction add is not available in time.

Page 16: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

16

Three Generic Classes of Data Hazards

● Read After Write (RAW)● Following instruction tries to use a result from previous

instruction, which is not available in that moment

add r1,r2,r3sub r6,r1,r4

● Impact of this type of the hazard can be reduced with hardware technique called forwarding.– The ALU result from EX/MEM and MEM/WB pipeline registers

is fed back to the ALU inputs.– Hardware must be able to detect RAW hazards.

● Called a dependency

Page 17: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

17

Three Generic Classes of Data Hazards

● Write After Read (WAR)● The following instruction writes its result into the register

before the previous instruction can read the original value.

mul r2,r1,r3add r1,r4,r5

● The hardware may not allow occurrence of this data hazard– It can not happen in this simple MIPS 5 stage pipeline– It is common for processors with complex pipelining

● Called anti-dependence

Page 18: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

18

Three Generic Classes of Data Hazards

● Write After Write (WAW)● Following instruction writes its own result into the

register before previous one.

sub r1,r2,r3add r1,r4,r3

● The hardware may not allow occurrence of this data hazards– It can not happen in this simple MIPS 5 stage pipeline

● Called output dependence

Q: Why the last two types of data hazards can not happen in this type of the processor?

Page 19: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

19

HW Changes for Forwarding

● Forwarding paths● Modify the

multiplexors to select where ALU operands should come from

● Logic for determining sources of ALU operands in ID stage

Page 20: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

20

Unsolved Data Hazards

● Not all types of the data hazards can be solved with usage of forwarding technique

● The data from the load (lw) instruction is available after the fourth clock cycle. Even forwarding technique can not deliver it into the ALU in the end of third clock cycle.

Page 21: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

21

How To Avoid Load Hazards

● Hardware assisted solution● Hardware must be able to recognize the load

hazards ● The pipeline is stalled when the load hazard occurs

● Software scheduling by compiling an application● The order of the instructions can be modified with

help of independent instructions during compilation into acceptable sequence without load hazards

Page 22: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

22

Control Hazards (Branch Hazards)● It can cause a greater performance loss then the data hazards

● Branch Hazards

● Change of PC may or may not be other than plus 4, when a branch is executed

Page 23: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

23

Branch Stall Impact

● Branch penalty for 5 stage MIPS pipeline is 3 clock cycle

● Branch target is known at the end of execute/address-calculation stage (3rd cycle clock)

● Minimization of branch penalty with HW modification

● In the instruction decode stage– Determines if the branch is taken or not

– Calculates its target address

● Branch penalty is minimized but still exists

● Branch prediction schemes

● Simple compile-time schemes● Dynamic branch prediction

Page 24: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

24

Reducing Pipeline Branch Penalties with a “Simple” HW Modification

● Early branch determination

● Target of the branch is known in the “ID” stage of a branch instruction

Page 25: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

25

Reducing Pipeline Branch Penalties with the Static Scheduling Strategy

● Stalls the pipeline until the branch destination is known

● The simplest scheme to handle branches

● Pipeline is to be freezed or flushed – wasting of CPU time

● Waste CPU time in case of 5 stage MIPS pipeline. Next instruction can not be fetched until branch destination is known

● Predicted branch not taken

● Next instruction is fetched as if the branch was not executed

– PC = PC + 4

● Instruction must be removed from the pipeline if the branch is taken

● Processor state can not change until branch outcome is known

● Approximately 47% MIPS branches are not taken

● Predicted branch taken

● Inverse alternative to the previous case

Page 26: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

26

● Delayed branch

● Branch takes place after a following instruction● Delay slot is amount of clock cycles which are needed to

obtain branch target address● The instructions following branch in the delay slot are

executed whether the branch is taken or not

branch instructionsequential successor1sequential successor2…sequential successorn

branch target if taken

Reducing Pipeline Branch Penalties with the Static Scheduling Strategy

Q: What is a length of the branch delay slot in 5 stage MIPS pipeline with early branch determination?

branch delay slotof length n

Page 27: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

27

Delayed Branch

● Where do we get instructions to fill branch delay slot?

● Canceling instructions from branch delay slot

● Second two cases may require canceling of the instructions from delay slot

● Instruction is executed but write-back must be disabled until the result from the branch is known.

Page 28: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

28

Dynamic Branch Prediction

● The simple branch prediction schemes are a branch-prediction buffer or branch history table● A small cache indexed by the lower bits of branch

address with limited number of entry contains a bit that says whether the branch was taken or not

● Prediction may not be correct. – Another branch with same lover bits of address can modify the

prediction bit– A branch condition cannot be predicted

● Local scope of the prediction bit● 2-bit prediction schemes are often used

Page 29: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

29

Two Bits Dynamic Branch Prediction

● The scope of the predictors is local (per branch prediction)

● Mis-prediction rate for 2-bits branch predictor with 4k entries is from 1% to 18%

Page 30: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

30

Advanced Branch Prediction Schemes

● Correlating Branch Predictors (Two-level adaptive predictors)● Takes a look at the recent behavior of other branches at

first● Behavior of the predicted branch has lower priority

● Tournament Predictors: Adaptively Combining Local and Global Predictors● Usage of multiple predictors

– Local base, Global base and combination of both bases on a selector

● Effective at medium size of prediction cache (8k – 32k bits)

Page 31: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

31

Extending the MIPS Pipeline to Handle Multi-cycle Operations

● 4 separate functional units

● Integer units (1 clock)– Integer ALU, load, store,

branches

● FP and integer multiplier (7 clock)

● FP adder (4 clock)– FP add, subtract,

conversion

● FP and integer divider– Unpipeline (24 clock)

● Brings the possibility of WAW, RAW data and structural hazards.

Page 32: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

32

Structural Hazards and Multi-cycle Operation Extension

● The probability of the structural hazards increases with the presence of unpipelined units

● Only one instruction can be in EX stage on these units (issuing problem)

● The number of register writes required in one cycle can be larger than 1, because of varying running time of instructions

● Possible solution is to implement – multiple write gateways or separate integer and floating-point

registers

– a simple shift register, that indicates when already-issued instructions will use register file and stall new instruction issue

– a mechanism to stall conflict instruction when it tries to enter MEM or WB stage

Page 33: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

33

Data Hazards and Multi-cycle Operation Extension

● Stalls for RAW will be more frequent (varying running time of instructions)

● WAW hazards are possible● Instructions do not reach WB stage in order in which

they were issued● The processor does not known anything about the

length of the pipeline by issuing of an instruction● A simple solution: if an instruction wants to write into the

same register as an instruction already issued, it is stalled.

Page 34: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

34

Super-scalar processors● Issuing one instruction every clock cycle is not enough for

effective usage of every function units of the processor

● Multiple instruction are issued, decoded and forwarded to their execution

● processor must have multiple function units for each pipeline stage

● multiple read/write gateways into the register set have to be present

● processor must have multiple read gateways into instruction cache (or memory)

Page 35: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

35

Super-scalar processors

● A lot of independent instruction have to be present in the code of an application

● Issued instructions can not contain any dependency to avoid data hazards

● Independent instruction can be scheduled statically by the compilation or dynamically with help of the hardware before their execution

Page 36: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

36

Superscalar processors● Types of superscalar processors

● Superscalar with dynamic issue structure– Some instructions can be executed out-of-order. No speculation during execution

– Not used today

● Superscalar with dynamic issue and speculative scheduling– Out-of-order execution with speculation

– Pentium 4, Core, Core2, .., IBM, Power5

● VLIW/LIW– Static issued instructions with static scheduling. Hazard detection and other

preparation in compilation time

– Embedded space

● EPIC (primarily static issued instructions with most static scheduling)– Hazard detection in compilation time

– Intel Itanium

Page 37: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

37

Advantage of Dynamic Scheduling

● Allows the code which was compiled for one pipeline to run efficiently on a different pipeline● Compiled code is not platform depended

● It simplifies the compiler● Optimization for efficient execution is not necessary

● It can handle dependencies that were not known at compile time.

Page 38: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

38

Dynamic Scheduling

● The simple idea of dynamic scheduling is to allow out-of-order execution of instructions after their dynamic issue.

● Out-of-order execution implies out-of-order completion and it introduces the possibility of WAR and WAW hazards. Both these hazards are avoided by the use of register renaming. (will be discussed in future chapter)

● To allow out-of-order execution it is essential to split the DI stage of the simple five-stage pipeline into two stages

● Issue (II) – decodes instruction and check for structural hazards - all in-order

● Read operands (RO) – wait until no data hazards, then read operands for all out-of-order waiting instructions

Page 39: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

39

Dynamic Scheduling

● Issue window represents count of instructions which can be forwarded to their out-of-order execution without being completed

● Out-of-order completion have to be tended with score board table or today more usually with help of thomasulo's algorithm

Page 40: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

40

Static scheduling at the compilation time

● VLIW/LIW or EPIC processors schedule the instruction at the compilation time.

● Advantage at compilation time scheduling

● Compiler has code of whole application to find independent instructions (instructions without dependencies)

● The Long instruction word with most static scheduled instruction can be created very effectively

● Effective usage of all processor units (the best case)

● Disadvantage

● Not all potential hazards states can by solved in the compilation time

● Compiler detects potential hazards and the hardware help is necessary for their solving– The processor has only limited ability to solve hazards problems

Page 41: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

41

Conclusion

● Introduction into the computer architecture description with help of ISA

● Matter of unpipelined instruction execution on processors with independent 5 stage execution

● Pipeline instruction execution on the reference processors with the simple 5 stage MIPS pipeline

● Classes of the hazards and their possible solutions

● Extension of 5 stage MIPS pipeline for multi-cycle instruction execution

● Introduction into super-scalar processors and dynamic and static scheduling of independent instructions

Page 42: Introduction into the Sequential and Pipeline Instruction ...wh.cs.vsb.cz/mil051/images/5/54/PAP-PR-01_Úvod_do_problematiky_zřetězeného...Advanced Computer Architecture Introduction

42

Literature

● John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach (4th Edition)

● Paul H. J. Kelly, Advanced Computer Architecture Lecture notes 332

● Andrew S. Tanenbaum, Operating Systems: Design and Implementation