instruction-level parallel processors chapter no. 4
TRANSCRIPT
What is instruction-level parallelism?
The potential for executing certain instructions in parallel, because they are independent.
Any technique for identifying and exploiting such opportunities.
Instruction-level parallelism (ILP)
Basic blocks Sequences of instruction that appear between branches Usually no more than 5 or 6 instructions!
Loops for ( i=0; i<N; i++) x[i] = x[i] + s;
We can only realize ILP by finding sequences of independent instructions
Dependent instructions must be separated by a sufficient amount of time
Pipelined Operation
A number of functional units are employed in sequence to perform a single computation.
Each functional unit represent a certain stage of computation.
Pipeline allows overlapped execution of instructions.
It increases the overall processor’s throughput.
Superscalar Processors
Increase the ability of the processor to use instruction level parallelism.
Multiple instructions are issued every cycle multiple pipelines operating in parallel.
received sequential stream of instructions.The decode and execute unit then issues
multiple instructions for the multiple execution units in each cycle.
VLIW architecture
Abbreviation of VLIW is Very Large Instruction Word.
VLIW architecture receives multi-operation instructions, i.e. with multiple fields for the control of EUs.
Basic structure of superscalar processors and VLIW is same multiple EUs, each capable of parallel
execution on data fetched from a register file.
Dependencies Between Instructions
dependencesData
arise from precedence
Control
requirements concerningreferenced data
arise fromconditionalstatements
Resource
arise fromlimited resources
dependences
Dependences(constraints)
dependences
Data Dependency
An instruction j depends on data from previous instruction i cannot execute j before the earlier instruction
i cannot execute j and i simultaneously.
Data dependence's are properties of the program whether this leads to data hazard or stall in a
pipeline depending upon the pipeline organization.
Data Dependency
Data can be differentiated according to the data involved and according to their type. The data involved in dependency may be from
register or from memory the type of data dependency may be either in
a straight-line code or in a loop.
Data Dependency in Straight- line Code
Straight-line code may contain three different types of dependencies: RAW (read after write) WAR (write after read) WAW (write after write)
RAW (Read after Write)
Read After Write (RAW)
i1: load r1, a; i2: add r2, r1, r1;
Assume a pipeline of Fetch/Decode/Execute/Mem/Writeback
Name Dependencies
A “name dependence” occurs when 2 instructions use the same register or memory location (a name) but there is no flow of data between the 2 instructions
There are 2 types: Antidependencies: Occur when an instruction j writes a
register or memory location that instruction i reads – and i is executed first
Corresponds to a WAR hazard
Output dependencies: Occur when instruction i and instruction j write the same register or memory location
Protected against by checking for WAW hazards
Write after Read (WAR)
Write after Read (WAR)
i1: mul r1, r2, r3; r1 <= r2 * r3 i2: add r2, r4 , r5; r2 <= r4 + r5
If instruction i2 (add) is executed before instruction i1 (mul) for some reason, then i1 (mul) could read the wrong value for r2.
Write after Read (WAR)
One reason for delaying i1 would be a stall for the ‘r3’ value being produced by a previous instruction. Instruction i2 could proceed because it has all its operands, thus causing the WAR hazard.
Use register renaming to eliminate WAR dependency. Replace r2 with some other register that has not been used yet.
Write after Write (WAW)
Write after Write (WAW)
i1: mul r1, r2, r3; r1 <= r2 * r3 i2: add r1, r4 , r5; r2 <= r4 + r5
If instruction i1 (mul) finishes AFTER instruction i2 (add), then register r1 would get the wrong value. Instruction i1 could finish after instruction i2 if separate execution units were used for instructions i1 and i2.
Write after Write (WAW)
One way to solve this hazard is to simply let instruction i1 proceed normally, but disable its write stage.
Data Dependency in Loops
Instruction belonging to a particular loop iteration may be dependent on the the instructions belonging to previous loop iterations.
This type of dependency is referred as recurrences or inter-iteration data dependency.
do X(I)= A*X(I-1) + Bend do
Data Dependency in Loops
Loops are a “common case” in pretty much any program so this is worth mentioning…Consider:for (j=0; j<=100; j++) {A[j+1] = A[j] + C[j]; /*S1*/B[j+1] = B[j] + A[j+1]; /*S2*/}
Data Dependency in Loops
Now, look at the dependence of S1 on an earlier iteration of S1 This is a loop-carried dependence; in other words, the
dependence exists b/t different iterations of the loop
Successive iterations of S1 must execute in order
S2 depends on S1 within an iteration and is not loop carried Multiple iterations of just this statement in a loop
could execute in parallel
Data Dependency in Graphs
Data dependency can also be represented by graphs
,tor a to denote RAW, WAR or WAR respectively. i1 i2
i3
i4
i5
t
o
o
ti1 : load r1, a;
i2: load r2, b;
i3: add r3, r1, r2;
i 4: mul r1, r2, r4;
i5: div r1, r2, r4;
Instruction interpretation: r3 (r1) (r2) etc.
i1 i2
I2 is dependent on i1
Control Dependence
Control dependence determines the ordering of an instruction with respect to a branch instruction if the branch is taken, the instruction is executed if the branch is not taken, the instruction is not
executed
An instruction that is not control dependent on a branch can not be moved before the branch instructions from then part of if-statement cannot be
executed before the branch
Control Dependence
An instruction that is not control dependent on a branch cannot be moved after the branch
other instructions cannot be moved into the then part of an if statementExample
sub r1, r2, r3;
jz zproc;
mul r4, r1, r1;
:
:
zproc: load r1, x:
:
Control Dependence
Branch ratio( %)Uncond. total branch
ratio (%) Uncond. branch ratio
(%)gen. purp. scientific gen. purp. scientific gen. purp. scientific
Yeh, Pat (1992) M8810024 5 78 82 20 4
Stephens & al. (1991)RS/6000
22 11 83 85 18 9
Lee, Smith (1984)IBM/370
29 10 73 73 21 8
PDP-11-70 39 46 18CDC 6400 8 53 4
Published branch statistics
Control Dependence
Frequent conditional branches impose a heavy performance constraint on ILP-processors.
Higher rate of instruction issue per cycle raise the probability of encountering conditional control dependency in each cycle.
Control Dependence
JC JC JC
JCJC JC
JC JC JC
JC JCJC
scalar issue
superscalar (or VLIW) issue
with 2 intructions/issue
with 3 intructions/issue
with 6 intructions/issue
t
JC: conditional branch
Control Dependency Graph
i5 i6
i0
i1
i2
i3
i4
i8
T F
T F
i7
i0: r1 = op1;
i1 : r2 = op2;
i2: r3 = op3;
i3: if (r2 >r1)
i4: { if (r3 >r1)
i5: r4 = r3;
i6: else r4 = r1}
i7: else r4 = r2;
i8: r5 = r4 * r4
Figure 5.2.11: An example for a CDG
Resource Dependency
An instruction is resource dependent on a previously issued instruction if it requires hardware resource which is still being used by previously issued instruction
If, for instance, there is only a single multiplication unit available, then in the code sequence
i1: div r1, r2, r3i2: div r4, r2, r5
i2 is resource dependent on i1
Instruction Scheduling
Why instruction scheduling is needed? Instruction scheduling involves:Detection
detect where dependency occurs in a code
Resolution removing dependencies from the code
two approaches for instruction scheduling static approach dynamic approach
Static Scheduling
Detection and resolution is accomplished by the compiler which avoids dependencies by reordering the code.
VLIW processors expects dependency free code generated by ILP compiler.
Dynamic Scheduling
Performed by the processorcontains two windows
issue windowcontains all fetched instructions which are
intended for issue in the next cycle.Issue window’s width is equal to issue rateall instructions are checked for
dependencies that may exists in an instruction.
Dynamic Scheduling
Execution windowcontains all those instructions which are still
in execution and whose results have not yet been produced are retained in execution window.
Instruction Scheduling in ILP-processors
Detection and resolution of dependencies
Parallel optimization
ILP-instruction scheduling
Parallel Optimization
Parallel optimization is achieved by reordering the sequence of instructions by appropriate code transformation for parallel execution.
Also, known as code restructuring or code reorganization.
Preserving Sequential Consistency
To maintain the logical integrity of program
div r1, r2, r3;ad r5, r6, r7;jz anywhere;