ilp saad saeed

Instruction Level Parallelism ILP Advanced Computer Architecture CSE 8383Spring 2004 2/19/2004Presented By: Saad Al-HarbiSaeed Abu Nimeh

OutlineWhats ILPILP vs Parallel ProcessingSequential execution vs ILP execution Limitations of ILPILP ArchitecturesSequential ArchitectureDependence ArchitectureIndependence ArchitectureILP SchedulingOpen ProblemsReferences

Whats ILPArchitectural technique that allows the overlap of individual machine operations ( add, mul, load, store )Multiple operations will execute in parallel (simultaneously)Goal: Speed Up the executionExample:load R1 R2add R3 R3, 1add R3 R3, 1add R4 R3, R2add R4 R4, R2store [R4] R0

Example: Sequential vs ILPSequential execution (Without ILP)Add r1, r2 r84 cyclesAdd r3, r4 r74 cycles 8 cycles

ILP execution (overlap execution)Add r1, r2 r8 Add r3, r4 r7

Total of 5 cycles

ILP vs Parallel ProcessingILPOverlap individual machine operations (add, mul, load) so that they execute in parallel

Transparent to the user

Goal: speed up execution

Parallel ProcessingHaving separate processors getting separate chunks of the program ( processors programmed to do so)

Nontransparent to the user

Goal: speed up and quality up

ILP ChallengesIn order to achieve parallelism we should not have dependences among instructions which are executing in parallel:H/W terminology Data Hazards ( RAW, WAR, WAW)S/W terminology Data Dependencies

Dependences and HazardsDependences are a property of programsIf two instructions are data dependent they can not execute simultaneouslyA dependence results in a hazard and the hazard causes a stall Data dependences may occur through registers or memory

Types of DependenciesName dependenciesOutput dependenceAnti-dependenceData True dependenceControl DependenceResource Dependence

Name dependencesOutput dependenceWhen instruction I and J write the same register or memory location. The ordering must be preserved to leave the correct value in the registeradd r7,r4,r3div r7,r2,r8Anti-dependenceWhen instruction j writes a register or memory location that instruction I readsi: add r6,r5,r4j: sub r5,r8,r11

Data DependencesAn instruction j is data dependent on instruction i if either of the following hold:instruction i produces a result that may be used by instruction j , orinstruction j is data dependent on instruction k, and instruction k is data dependent on instruction iLOOPLDF0, 0(R1)

ADDF4, F0, F2

SDF4, 0(R1)

SUBR1, R1, -8

BNER1, R2, LOOP

Control DependencesA control dependence determines the ordering of an instruction i, with respect to a branch instruction so that the instruction i is executed in correct program order.

Example:If p1 { S1;};If p2 { S2;};

Two constraints imposed by control dependences: An instruction that is control dependent on a branch cannot be moved before the branch An instruction that is not control dependent on a branch cannot be moved after the branch

Resource dependencesAn instruction is resource-dependent on a previously issued instruction if it requires a hardware resource which is still being used by a previously issued instruction.e.g.div r1, r2, r3div r4, r2, r5

ILP ArchitecturesComputer Architecture: is a contract (instruction format and the interpretation of the bits that constitute an instruction) between the class of programs that are written for the architecture and the set of processor implementations of that architecture.In ILP Architectures: + information embedded in the program pertaining to available parallelism between instructions and operations in the program

ILP Architectures ClassificationsSequential Architectures: the program is not expected to convey any explicit information regarding parallelism. (Superscalar processors)Dependence Architectures: the program explicitly indicates the dependences that exist between operations (Dataflow processors)Independence Architectures: the program provides information as to which operations are independent of one another. (VLIW processors)

Sequential architecture and superscalar processorsProgram contains no explicit information regarding dependencies that exist between instructionsDependencies between instructions must be determined by the hardwareIt is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completedCompiler may re-order instructions to facilitate the hardwares task of extracting parallelism

Superscalar ProcessorsSuperscalar processors attempt to issue multiple instructions per cycle However, essential dependencies are specified by sequential ordering so operations must be processed in sequential orderThis proves to be a performance bottleneck that is very expensive to overcome

Dependence architecture and data flow processors The compiler (programmer) identifies the parallelism in the program and communicates it to the hardware (specify the dependences between operations)The hardware determines at run-time when each operation is independent from others and perform schedulingHere, no scanning of the sequential program to determine dependencesObjective: execute the instruction at the earliest possible time (available input operands and functional units).

Dependence architectures Dataflow processorsDataflow processors are representative of Dependence architecturesExecute instruction at earliest possible time subject to availability of input operands and functional units Dependencies communicated by providing with each instruction a list of all successor instructions As soon as all input operands of an instruction are available, the hardware fetches the instructionThe instruction is executed as soon as a functional unit is availableFew Dataflow processors currently exist

Dataflow strengths and limitationsDataflow processors use control parallelism alone to fully utilize the FU.Dataflow processor is more successful than others at looking far down the execution path to find control parallelismWhen successful its better than speculative execution:Every instruction is executed is usefulProcessor does not have to deal with error conditions, because of speculative operations

Independence architecture and VLIW processorsBy knowing which operations are independent, the hardware needs no further checking to determine which instructions can be issued in the same cycleThe set of independent operations >> the set of dependent operationsOnly a subset of independent operations are specifiedThe compiler may additionally specify on which functional unit and in which cycle an operation is executedThe hardware needs to make no run-time decisions

VLIW processorsOperation vs instructionOperation: is an unit of computation (add, load, branch = instruction in sequential ar.)Instruction: set of operations that are intended to be issued simultaneouslyCompiler decides which operation to go to each instruction (scheduling) All operations that are supposed to begin at the same time are packaged into a single VLIW instruction

VLIW strengthsIn hardware it is very simple: consisting of a collection of function units (adders, multipliers, branch units, etc.) connected by a bus, plus some registers and cachesMore silicon goes to the actual processing (rather than being spent on branch prediction, for example), It should run fast, as the only limit is the latency of the function units themselves.Programming a VLIW chip is very much like writing microcode

VLIW limitationsThe need for a powerful compiler,Increased code size arising from aggressive scheduling policies, Larger memory bandwidth and register-file bandwidth, Limitations due to the lock-step operation, binary compatibility across implementations with varying number of functional units and latencies

Summary: ILP Architectures

ILP SchedulingStatic Scheduling boosted by parallel code optimizationdone by the compilerThe processor receives dependency-free and optimized code for parallel executionTypical for VLIWs and a few pipelined processors (e.g. MIPS)

Dynamic Scheduling without static parallel code optimizationdone by the processorThe code is not optimized for parallel execution. The processor detects and resolves dependencies on its ownEarly ILP processors (e.g. CDC 6600, IBM 360/91 etc.)

Dynamic Scheduling boosted by static parallel code optimizationdone by processor in conjunction with parallel optimizing compilerThe processor receives optimized code for parallel execution, but it detects and resolves dependencies on its own Usual practice for pipelined and superscalar processors (e.g. RS6000)

ILP Scheduling: Trace schedulingAn optimization technique that has been widely used for VLIW, superscalar, and pipelined processors. It selects a sequence of basic blocks as a trace and schedules the operations from the trace together. Example:Instr1Instr2Branch xInstr3

Trace SchedulingExtract more ILPIncrease machine fetch bandwidth by storing logically consecutive blocks in physically contiguous cache location (possible to fetch multiple basic blocks in one cycle)Trace scheduling can be implemented by hardware or software

Trace Scheduling in HWHardware technique makes use of a large amount of information in dynamic execution to format traces dynamically and schedule the instructions in trace more efficiently. Since the dependency and memory access addresses have been solved in dynamic execution, instructions in trace can be reordered more easily and efficiently. Example: trace cache approach

Trace scheduling in SWSupplement to machines without hardware trace scheduling support.Formats traces based on static profiled data, and schedules instructions using traditional compiler scheduling and optimization technique.It faces some difficulties like code explosion and exception handling.

ILP open problemsPipelined scheduling : Optimized scheduling of pipelined behavioral descriptions. Two simple type of pipelining (structural and functional). Controller cost : Most scheduling algorithms do not consider the controller costs which is directly dependent on the controller style used during scheduling. Area constraints : The resource constrained algorithms could have better interaction between scheduling and floorplanning. Realism : Scheduling realistic design descriptions that contain several special language constructs.Using more realistic libraries and cost functions. Scheduling algorithms must also be expanded to incorporate different target architectures.

ReferencesInstruction-Level Parallel Processing: History, Overview and Perspective. B. Ramakrishna Rau, Joseph A. Fisher. Journal of Supercomputing, Vol. 7, No. 1, Jan. 1993, pages 9-50.

Limits of Control Flow on Parallelism. Monica S. Lam, Robert P. Wilson. 19th ISCA, May 1992, pages 19-21.

Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2. Joseph A. Fisher. Technical Report, HPLabs HPL-93-43, Jun. 1993.

VLIW at IBM Research http://www.research.ibm.com/vliw

Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC, Dick Pountain http://www.byte.com/art/9604/sec8/art3.htm

Hardware and Software Trace Scheduling http://charlotte.ucsd.edu/users/yhu/paperlist/summary.html

ILP open problemshttp://www.ececs.uc.edu/~ddel/projects/dss/hls_paper/node9.html

Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3rd edition, M Kaufmann

ilp saad saeed

Documents

branch instruction

issued instruction

instruction iloopldf0

contract instruction

data dependent

r2add r4 r4

r3div r4

load r1 r2add r3 r3