introduction into the sequential and pipeline instruction...

Advanced Computer Architecture

Introduction into the Sequential and Pipeline Instruction Execution

Chapter 1

Martin Milata

2

● Instruction Set Architecture (ISA)

● Defines an interface (or boundary) between software and hardware

● Describes class of ISA, addressing and access to a memory, operands a operations, controls the flow of instructions, encoding ISA

● Design and Organization

● Describes the high-level aspect of the processors design.

● Defines logical blocks or units that processors contain (ALU, FPU, General-purpose register, MMU, Controller, … )

● Hardware

● Includes detailed logic design of the processors – the low level aspect (expected process technology)

What is a Processors Architecture

3

Evolution of the Microprocessor Architecture - ISA

● CISC and RISC architecture

● Today CISC instructions are usually translated internally on RISC-like instructions in the most x86 architectures

● Two critical performance techniques

● Instruction Level Parallelism (ILP) – pipelining and multiple instruction issue later

● Cache memory usage – from the simple to the hierarchical cache memory with sophistical organization and optimization

● Growth of processor performance today

● Using ILP, Cache memory and specialized processor units have brought ~50% performance growth per year for 16 years since 2002

● Today it is ~20% per year in the uniprocessor performance

● The future growth is focused on the usage of TLP and DLP techniques

4

Evolution of the Microprocessor Architecture - Design

● Design and Organization

● From simple processors with single ALU to complex multi-core processors in these days

● CPU off-load on peripheral devices (for example graphic or network cards)

● Integration of peripheral devices on chip (northbridge, graphic card)

● Hardware evolution

● The first fully electronic digital computing device - Atanasoff–Berry Computer, which was made based on vacuum tube.

● The modern process technology 32nm respectively 22nm, which are used in productions today.

5

A Reference RISC processor (1)

● A RISC architecture will be used to illustrate the basic concepts. The introduced ideas are of course applicable to the other processor architectures .

● A “typical” RISC processor

● 32-bit fixed format instruction● 32-bit general-purpose registers● Memory access only via Load and Store instructions with the

single addressing mode (based + displacement)● Simple branch conditions● Delayed branch

6

A Reference RISC processor (2)

● Instruction types● Register-Register

● Register-Immediate

● Branch

● Jump/Call

● 5 stage of the RISC instruction

● Instruction fetch cycle (IF)

● Instruction decode/register fetch cycle (ID)

● Execution/effective address cycle (EX)

● Memory access (MEM)

● Write-back cycle (WB)

● Every stage of the instruction uses no shared resources of the processor (expect for MEM and IF)

7

5 Stages of MIPS Datapath

8

Sequential Unpipeline Instruction Execution

● It was used in the first computers.

● Processor architecture model is called sub-scalar architecture.

● Instructions are executed one by one without any overlap in sequential order.

● The same processing time for each instruction stage is not needed

● Some parts of the processor may not be used in some clock cycles.

● Amount of time, which is needed to execute the whole process, is sum of execution time of all instructions.

Tc=T.N.CPI ● Tc - Total execution time, T - Clock period, N - Total number of instructions, CPI - Clock per

instruction

● It is still used in very simple computers (like simple calculators)

Q: How many cycles require execution of these two instructions?

9

Instruction Pipeline Execution

● The overlap of the execution time of instructions brings better usage of the processor units.

● Scalar processor architecture model

● It does not reduce the execution time of the instruction

● It increases the CPU instruction throughput

● Some modifications of hardware architecture for pipeline execution are needed

● Bypassing and/or short-circuiting

● Pipeline registers

● Pipeline overhead

● Setup time on pipeline registers

Q: Will it work correctly?

10

MIPS Pipeline Datapath

11

Pipeline Timing, Throughput and Latency

● Pipeline Timing

● Unpipeline execution– Time of each stage can be different

● Pipeline execution– Same execution

time for each stage

– Balanced pipeline

● Pipeline Throughput

● Count of completed instructions per second

● Pipeline Latency

● How long it takes to execute single instruction in pipeline.

12

Pipeline Hazards

● Hazards prevent the execution of the next instruction in its designated clock cycle.

● They reduce speedup of pipeline execution

● There are three classes of hazards

● Structural hazards: arise from resource conflicts. Hardware may not support all combinations of instruction in overlapped execution.

● Data hazards: Instruction depends on result of previous instruction, which is not available when it is needed.

● Control hazards: May arise when the program counter is changed by the flow control instruction (branch).

13

Structural Hazards● Example 1: Memory access of instruction fetch and memory

access of data, on the processor with single memory access channel.

● Instruction “OR” can not be loaded in cycle 4 due to structural hazard (there is not gate for access into the memory for getting a new instruction)

14

Structural Hazards● Possible solution of structural hazard from Example 1

● Separate instruction and data memory (brings higher cost of CPU)

● Stall the pipeline for one cycle (decrease pipeline execution speedup)

15

Data Hazard● RAW data hazard on register R1

● Hazard has impact on instructions sub, add and or. Result from the first instruction add is not available in time.

16

Three Generic Classes of Data Hazards

● Read After Write (RAW)● Following instruction tries to use a result from previous

instruction, which is not available in that moment

add r1,r2,r3sub r6,r1,r4

● Impact of this type of the hazard can be reduced with hardware technique called forwarding.– The ALU result from EX/MEM and MEM/WB pipeline registers

is fed back to the ALU inputs.– Hardware must be able to detect RAW hazards.

● Called a dependency

17


● Write After Read (WAR)● The following instruction writes its result into the register

before the previous instruction can read the original value.

mul r2,r1,r3add r1,r4,r5

● The hardware may not allow occurrence of this data hazard– It can not happen in this simple MIPS 5 stage pipeline– It is common for processors with complex pipelining

● Called anti-dependence

18


● Write After Write (WAW)● Following instruction writes its own result into the

register before previous one.

sub r1,r2,r3add r1,r4,r3

● The hardware may not allow occurrence of this data hazards– It can not happen in this simple MIPS 5 stage pipeline

● Called output dependence

Q: Why the last two types of data hazards can not happen in this type of the processor?

19

HW Changes for Forwarding

● Forwarding paths● Modify the

multiplexors to select where ALU operands should come from

● Logic for determining sources of ALU operands in ID stage

20

Unsolved Data Hazards

● Not all types of the data hazards can be solved with usage of forwarding technique

● The data from the load (lw) instruction is available after the fourth clock cycle. Even forwarding technique can not deliver it into the ALU in the end of third clock cycle.

21

How To Avoid Load Hazards

● Hardware assisted solution● Hardware must be able to recognize the load

hazards ● The pipeline is stalled when the load hazard occurs

● Software scheduling by compiling an application● The order of the instructions can be modified with

help of independent instructions during compilation into acceptable sequence without load hazards

22

Control Hazards (Branch Hazards)● It can cause a greater performance loss then the data hazards

● Branch Hazards

● Change of PC may or may not be other than plus 4, when a branch is executed

23

Branch Stall Impact

● Branch penalty for 5 stage MIPS pipeline is 3 clock cycle

● Branch target is known at the end of execute/address-calculation stage (3rd cycle clock)

● Minimization of branch penalty with HW modification

● In the instruction decode stage– Determines if the branch is taken or not

– Calculates its target address

● Branch penalty is minimized but still exists

● Branch prediction schemes

● Simple compile-time schemes● Dynamic branch prediction

24

Reducing Pipeline Branch Penalties with a “Simple” HW Modification

● Early branch determination

● Target of the branch is known in the “ID” stage of a branch instruction

25

Reducing Pipeline Branch Penalties with the Static Scheduling Strategy

● Stalls the pipeline until the branch destination is known

● The simplest scheme to handle branches

● Pipeline is to be freezed or flushed – wasting of CPU time

● Waste CPU time in case of 5 stage MIPS pipeline. Next instruction can not be fetched until branch destination is known

● Predicted branch not taken

● Next instruction is fetched as if the branch was not executed

– PC = PC + 4

● Instruction must be removed from the pipeline if the branch is taken

● Processor state can not change until branch outcome is known

● Approximately 47% MIPS branches are not taken

● Predicted branch taken

● Inverse alternative to the previous case

26

● Delayed branch

● Branch takes place after a following instruction● Delay slot is amount of clock cycles which are needed to

obtain branch target address● The instructions following branch in the delay slot are

executed whether the branch is taken or not

branch instructionsequential successor1sequential successor2…sequential successorn

branch target if taken

Reducing Pipeline Branch Penalties with the Static Scheduling Strategy

Q: What is a length of the branch delay slot in 5 stage MIPS pipeline with early branch determination?

branch delay slotof length n

27

Delayed Branch

● Where do we get instructions to fill branch delay slot?

● Canceling instructions from branch delay slot

● Second two cases may require canceling of the instructions from delay slot

● Instruction is executed but write-back must be disabled until the result from the branch is known.

28

Dynamic Branch Prediction

● The simple branch prediction schemes are a branch-prediction buffer or branch history table● A small cache indexed by the lower bits of branch

address with limited number of entry contains a bit that says whether the branch was taken or not

● Prediction may not be correct. – Another branch with same lover bits of address can modify the

prediction bit– A branch condition cannot be predicted

● Local scope of the prediction bit● 2-bit prediction schemes are often used

29

Two Bits Dynamic Branch Prediction

● The scope of the predictors is local (per branch prediction)

● Mis-prediction rate for 2-bits branch predictor with 4k entries is from 1% to 18%

30

Advanced Branch Prediction Schemes

● Correlating Branch Predictors (Two-level adaptive predictors)● Takes a look at the recent behavior of other branches at

first● Behavior of the predicted branch has lower priority

● Tournament Predictors: Adaptively Combining Local and Global Predictors● Usage of multiple predictors

– Local base, Global base and combination of both bases on a selector

● Effective at medium size of prediction cache (8k – 32k bits)

31

Extending the MIPS Pipeline to Handle Multi-cycle Operations

● 4 separate functional units

● Integer units (1 clock)– Integer ALU, load, store,

branches

● FP and integer multiplier (7 clock)

● FP adder (4 clock)– FP add, subtract,

conversion

● FP and integer divider– Unpipeline (24 clock)

● Brings the possibility of WAW, RAW data and structural hazards.

32

Structural Hazards and Multi-cycle Operation Extension

● The probability of the structural hazards increases with the presence of unpipelined units

● Only one instruction can be in EX stage on these units (issuing problem)

● The number of register writes required in one cycle can be larger than 1, because of varying running time of instructions

● Possible solution is to implement – multiple write gateways or separate integer and floating-point

registers

– a simple shift register, that indicates when already-issued instructions will use register file and stall new instruction issue

– a mechanism to stall conflict instruction when it tries to enter MEM or WB stage

33

Data Hazards and Multi-cycle Operation Extension

● Stalls for RAW will be more frequent (varying running time of instructions)

● WAW hazards are possible● Instructions do not reach WB stage in order in which

they were issued● The processor does not known anything about the

length of the pipeline by issuing of an instruction● A simple solution: if an instruction wants to write into the

same register as an instruction already issued, it is stalled.

34

Super-scalar processors● Issuing one instruction every clock cycle is not enough for

effective usage of every function units of the processor

● Multiple instruction are issued, decoded and forwarded to their execution

● processor must have multiple function units for each pipeline stage

● multiple read/write gateways into the register set have to be present

● processor must have multiple read gateways into instruction cache (or memory)

35

Super-scalar processors

● A lot of independent instruction have to be present in the code of an application

● Issued instructions can not contain any dependency to avoid data hazards

● Independent instruction can be scheduled statically by the compilation or dynamically with help of the hardware before their execution

36

Superscalar processors● Types of superscalar processors

● Superscalar with dynamic issue structure– Some instructions can be executed out-of-order. No speculation during execution

– Not used today

● Superscalar with dynamic issue and speculative scheduling– Out-of-order execution with speculation

– Pentium 4, Core, Core2, .., IBM, Power5

● VLIW/LIW– Static issued instructions with static scheduling. Hazard detection and other

preparation in compilation time

– Embedded space

● EPIC (primarily static issued instructions with most static scheduling)– Hazard detection in compilation time

– Intel Itanium

37

Advantage of Dynamic Scheduling

● Allows the code which was compiled for one pipeline to run efficiently on a different pipeline● Compiled code is not platform depended

● It simplifies the compiler● Optimization for efficient execution is not necessary

● It can handle dependencies that were not known at compile time.

38

Dynamic Scheduling

● The simple idea of dynamic scheduling is to allow out-of-order execution of instructions after their dynamic issue.

● Out-of-order execution implies out-of-order completion and it introduces the possibility of WAR and WAW hazards. Both these hazards are avoided by the use of register renaming. (will be discussed in future chapter)

● To allow out-of-order execution it is essential to split the DI stage of the simple five-stage pipeline into two stages

● Issue (II) – decodes instruction and check for structural hazards - all in-order

● Read operands (RO) – wait until no data hazards, then read operands for all out-of-order waiting instructions

39

Dynamic Scheduling

● Issue window represents count of instructions which can be forwarded to their out-of-order execution without being completed

● Out-of-order completion have to be tended with score board table or today more usually with help of thomasulo's algorithm

40

Static scheduling at the compilation time

● VLIW/LIW or EPIC processors schedule the instruction at the compilation time.

● Advantage at compilation time scheduling

● Compiler has code of whole application to find independent instructions (instructions without dependencies)

● The Long instruction word with most static scheduled instruction can be created very effectively

● Effective usage of all processor units (the best case)

● Disadvantage

● Not all potential hazards states can by solved in the compilation time

● Compiler detects potential hazards and the hardware help is necessary for their solving– The processor has only limited ability to solve hazards problems

41

Conclusion

● Introduction into the computer architecture description with help of ISA

● Matter of unpipelined instruction execution on processors with independent 5 stage execution

● Pipeline instruction execution on the reference processors with the simple 5 stage MIPS pipeline

● Classes of the hazards and their possible solutions

● Extension of 5 stage MIPS pipeline for multi-cycle instruction execution

● Introduction into super-scalar processors and dynamic and static scheduling of independent instructions

42

Literature

● John L. Hennessy, David A. Patterson, Computer Architecture: A Quantitative Approach (4th Edition)

● Paul H. J. Kelly, Advanced Computer Architecture Lecture notes 332

● Andrew S. Tanenbaum, Operating Systems: Design and Implementation

introduction into the sequential and pipeline instruction...

Documents