cs8803: compilers for embedded system santosh pande – summer 2007 chapter 8 compiling for vliws...

CS8803: Compilers for Embedded SystemSantosh Pande – Summer 2007

Chapter 8Compiling for VLIWs and ILP

Outline

• 8.1 Profiling• 8.2 Scheduling

– Acyclic Region Types and Shapes– Region Formation– Schedule Construction– Resource Management During Scheduling– Loop Scheduling– Clustering

• 8.3 Register Allocation• 8.4 Speculation and Predication• 8.5 Instruction Selection

Overview

• This chapter…– Focuses on optimizations, or code transformations– These topics are common across all types of ILP-

processors, for both general-purpose and embedded applications

– Compilers and toolchains used for embedded processors are very similar to those in general-purpose computers

1. Profiling

• Profiles– Statistics about how a program spends its time

and resources– Many ILP optimizations require good profile

information

• Two types of profiles– “Point profiles”

• Call graphs and CFG

– “Path profiles”

Types of Profiles

• Call graph– Nodes: procedures– Edges: procedure calls– Information

• How many times each proc was called?• How many times each caller proc invoked a callee?

– Limitation: • Can’t tell what to do possibly beneficial procedures

Types of Profiles (cont.)

• Control Flow Graph (CFG)– Nodes: each basic blocks

• Basic block: a sequence of always executed instructions

– Edges: one basic block can execute after another basic block

– Information• How many times a particular basic block was executed?• How many times control flowed from one basic block to

one of its immediate neighbors?

Call Graph

Control Flow Graph

Types of Profiles (cont.)

• Path profiles– Measuring # of times a path, or sequence of

contiguous blocks in CFG is executed– Optimizations using path profiles appeared in

research compilers, but not into production compilers

– Note that call graphs and CFG are “Point profiles”

Profile Collection

• Instrumentation– Extra code is inserted into program to gather data– Can be done by compilers or post-compilation tool

• e.g. Pin: dynamic instrumentation tools and API– http://rogue.colorado.edu/pin/

– Hardware techniques• Special registers record stats various events• Statistical-sampling profilers

Synthetic Profiles (Heuristics in Lieu of Profiles)

• Synthetic profile– Assigns weights to each part of program based

solely on the structure of source program– Pros

• Need not to collect stats on actual running programs

– Cons• Can’t see how the program behaves w/ read data

– None of synthetic profile techniques does as well as actual profiling

2. Scheduling

• Instruction scheduling– Directly responsible for identifying and grouping

operations that can be executed in parallel

• Taxonomy– Cyclic: operates on loops in the program– Acyclic: handles loop-free regions, not directly loops– Current compilers include both schedulers

• Hardware support– Helps the choices available to the scheduler

Acyclic Region Types and Shapes

• Shapes of Regions– Basic blocks, Traces, …

• Basic Blocks– A “degenerate” form of region– Maximal straight line code fragments

Acyclic Region Types and Shapes (cont.)

• Traces: the first proposed region– Linear paths thru code: multiple-entrances & exits– A trace consists of the operations from a list of

basic blocks with the following properties• Each basic block is a predecessor of the next on the list

– e.g. Bk falls thru or branches to Bk+1

• For any i and k, there is no path Bi->Bk->Bi except for those that go through B0

– e.g. Code is cycle free, except entire region can be part of some encompassing loop

– Allow forward branches and so on: complex!14

Basic Block

Control Flow

Trace:a linear,

multiple-entry, multiple-exit

region

side entrance

• Superblocks– Traces with added restriction

• Single-entry, multiple-exit traces

– Same properties with traces, but one addition• There may be no branches into a block in the region,

except to B0. These outlawed branches are referred to in the superblock literature as side entrances

– Tail duplication: a region enlarging technique• Avoids side entrances and adds compensation codes

Tail duplication to eliminate side

entrances

e.g. 70*0.8=56

Superblock

• Hyperblocks– Single-entry, multiple-exit regions with internal

control flow– Variants of superblocks that employ predication to

fold multiple control paths into a single superblock– Removing some control flow complexity

Hyperblock

if-conversion of basic blocks B2,

• Treegions– Regions containing a tree of basic blocks within

the control flow of the program– Properties

• Each basic block Bj except for B0 has exactly one predecessor.

• That predecessor, Bi, is on the list, where i < j.

– Any path thru treegion yield a superblock• A trace with no side entrances

– Treegion-2: w/o restriction on side entrances

Treegion 1

Treegion 2

Treegion 3

Trace-2: trace

• Percolation Scheduling– Many code motion rules are applied to regions

that resemble traces– One of the earliest versions of DAG scheduling

• DAG scheduling: most general of acyclic scheduling

• Cycle scheduler– Limited region shapes

• A single innermost loop• An inner loop that has very simple control flow

Region Formation

• So far, discussed about a region shape• Remaining two questions

– Region Formation• How does one divide a program into regions?• Region formation is more than selecting good regions

from CFG; also includes duplication (region enlargement)

– Schedule Construction• How does one build schedules for them?• Well-selected regions are critical for schedule

construction– Using profiles: how frequently executed?

Region Formation (cont.)

• Region Selection– Trace growing

• The most popular algorithm

– Using the mutual most likely heuristic– Steps

• A is the last block of the current trace• Block B is A’s most likely successor, and vice versa

– A and B are “mutually most likely”

• Adds B to the trace• Repeats until no mutually-most-likely successor

• Region Selection– Shortcomings of using point profiles

• Cumulative effect of conditional probability• Point profiles independently measure probability• Probability of remaining on the trace rapidly decreases• Example:

– A trace that crosses ten splits, each with 90% of staying on the trace, appears to have only 35% (=0.9^10) probability of running from start to end

• Solutions: – building different shaped regions, predication– Using predication to remove branches

• Region Selection– Hyperblock formation

• Based on the mutual-most-likely trace formation• Considers block size and execution frequency• Predication can remove unpredictable branches

– Researches on better statistics• Using global, bounded-length path profiles to improve

static branch prediction

• Enlargement Techniques– Region selection is not enough alone– Needs to increase ILP by using enlargement

• Code size increased, but better scheduled code• Based on the fact programs iterate (loop)

– Loop unrolling• Performed before region selection to make the larger

unrolled codes available to region selector• Induction variable simplification and etc performed to

expose more parallelism across iterations

• Simplified example of variants of loop unrollingFor while loop:most general

For for loop:counted loops

• Induction var manipulations for loops

• Enlargement Techniques– Different approach for superblocks

• Superblock loop unrolling– Unrolling superblock loops (the most likely exit from some

superblocks jump to the beginning)

• Superblock loop peeling– Profile suggests a small # of iterations for the superblock loop– The expected # of iterations is copied

• Superblock target expansion– Similar to the mutual-most-likely heuristic for growing traces– If superblock A ends in a likely branch to B, then B is added

Superblock-enlarging optimizations

Target expansion Loop unrolling Loop peeling

• Phase-ordering Considerations– Which one first?

• Multiflow compiler: enlargement before trace selection• Superblock-based chose and formed superblocks first• Neither is clearly preferable

– Other transformations• i.e. Dependence height reduction should be run before

region formation

Schedule Construction

• So far, discussed about region formations– Selecting and enlarging individual regions

• A Schedule– Set of annotations that indicate unit assignment

and cycle time of the operations in a region– Depending on the shape of the region

• Goal: minimizing objective function– Estimated completion time + code size or energy

efficiency (in embedded systems)

Schedule Construction (cont.)

• Analyzing Programs for Schedule Construction– Dependences (data & control) prohibit reordering

• Partial ordering on the pieces of code• Represented as a DAG or its variants

– DDG (data dependence graph)– PDG (program dependence graph)

• Creating DDG and PDG typically O(n^2)– Where, n is the number of operations

36Data dependences example

Output dependence

True dependence

37Control dependence example

Control flow example

• Compaction Techniques– Cycle versus Operation Scheduling

• Two strategies to minimize an objective function• 1) Operation scheduling

– Selects an operation in the region and allocates it in the “best” cycle w/o dependences

• 2) Cycle scheduling– Fills a cycle with operations from region, proceeding to the

next cycle only after exhausting available operations

• Operation scheduling is theoretically powerful because of consideration of long-latency operations

• Compaction Techniques– Linear Techniques

• Algorithm using DDG gives O(n^2) • In practical, linear O(n) used in modern compilers• Two techniques• 1) As-soon-as-possible (ASAP) scheduling

– Placing op in the earliest possible cycle (top-down linear scan)

• 2) As-late-as-possible (ALAP) scheduling– Placing op in the latest possible cycle (bottom-up linear scan)

• Example: critical-path scheduling uses ASAP followed by ALAP to identify operations in the critical path

• Compaction Techniques– Graph-based Techniques (List Scheduling)

• Linear techniques can’t see the global properties (DDG)• Repeatedly assigning a cycle to operation w/o

backtracking (greedy algorithms): O(nlogn)• Steps

– Selects an operation from a data-ready-queue (DRQ)– An op is ready when all of its DDG predecessors scheduled– Once scheduled, op is removed from the DRQ

• Performance is dependent on the order selecting candidates, or on the scheduler’s greediness

• Compensation Code– Restoring the correct flow of data and control– Four basic scenarios

• (a) No Compensation– Code motion don’t change relative order of

operations wrt joins and splits– Also covers moving operations above a split

point (becoming speculative)– Recall that compensation code for speculative

code motions depends on recovery model

• Compensation Code

• (b) Joint Compensation– B moves above a join point A– Drop a copy of B (B’) in the join path

• (c) Split Compensation• Split op B (i.e. branch) moves

above a previous op A• Produces a copy of A (A’) in the

split path

• Compensation Code

• Summary– In general, make sure preserve all paths from the

original sequence in the transformed control flow after scheduling

• (d) Joint Compensation– Splits moved above joins (in the figure)– Splits moved above splits

Z-B-W path

Resource Management During Scheduling

• Resource hazards– Dependences and operational latencies and

available resources (i.e., functional units)

• Approaches– Reservation table: a simple and early method– Using finite-state-automata

Resource Management During Scheduling (cont.)

• Resource Vectors– Easy scheduling of instructions– Row: each cycle of schedule– Col: each resource in the machine– Recent work on reduction of the size

• Finite-state Automata– Intuition

• Is this instruction sequence a resource-legal schedule?– Similar with “Does this FSA accept this string?”

• A schedule is a sequence of instructions– Similar with “a string is a sequence of alphabet character”– Resource-valid schedules = a language

– FSAs are enough to accept these language– Several approaches for improving efficiency

• Breaking them into “factor” automata, reversing automata, and non-determinism

• Finite-state Automata

• Original automaton: representing two-resource machine• Factored automata: “Letter” and “Number” since independent operations• Cross-product of factored automaton is equivalent to the original one

• TODO:– Reverse automata?– Nondeterminism?

Loop Scheduling

• Loop scheduling approaches– Most of execution time spent in loops– The simplest approach was loop unrolling– Software pipelining

• Exploits inter-iteration ILP: parallelism across iterations• Modulo scheduling

– Produces a kernel of code– Kernel: overlapped multiple iterations of a loop, where

neither data dependence, nor resource conflicts

• Prologues and epilogues code is needed for correctness– Increased code size, H/W techniques can reduce this

• Conceptual illustration of software pipelining

Loop Scheduling

Loop Scheduling (cont.)

• Modular Scheduling– Initiation Interval (II)

• The length of the kernel: the constant interval b/w start of successive kernel iterations

• Minimum II (MII)– Determines lower bound on II

• Two constraints on the MII– Recurrence-constrained minimum II (RecMII)– Resource-constrained minimum II (ResMII)

• Modular Scheduling– Goal

• Arranging operations so that they can be repeated at the smallest possible II distance (related throughput)

– Rather than minimizing the stage count of each iterations, which means minimizing latency

– But, stage count is also important because it relates to prologue (pipeline filling) and epilogue (pipeline draining)

– Downsides of modular scheduling• Hard to handle nested loops• Control flow in the loop handled by only predication

• Conceptual model of Modulo scheduling– 4-wide, load (3 cycles), mult & compare (2 cycles)

How many inter-iteration dependences?

• Modular Scheduling– Modulo Reservation Table (MRT)

• Find a resource conflict-free schedule over multiple II intervals

• Ensure the same resources are not reused more than once in the same cycle

• MRT records and checks resources usage for cycle

Modulo Reservation Table

• Modular Scheduling– Searching for the II

• Find two candidates: minII and maxII• maxII: trivial, sum of all latencies of operations in loop• minII: complex, max(resII, recII)

– Consider resource constraints, and both intra- and inter-iteration dependences

• Then, find a legal schedule within the range– Usually using a modified list scheduling in which resource

checking for each assignment through MRT

• Modular Scheduling– Searching for the II

• basic scheme of iterative modulo scheduling

minII = compute_minII();maxII = compute_maxII();found = false;II = minII;while (!found && II < maxII) { found = try_to_modulo_schedule(II, budget); II = II + 1;}if (!found)trouble(); /* wrong maxII */

• Modular Scheduling– Prologues and Epilogues

• Partial copies of kernel• More complex when multiple-exit loops• In practice, multiple epilogues are almost always a

necessity (but, this is beyond our scope!)• Kernel-only loop scheduling

– Condition 1: prologues and epilogues are proper subsets of kernel code in which some operations have been disabled

– Condition 2: fully predication architecture

Kernel-only code by predicates

• Modular Scheduling– Modulo Variable Expansion

• MRT solved a correct resource scheduling for a given II• What about register allocation when lifetime of a value

within an iteration exceeds the II length?– Simple register allocation policy won’t work: overwritten!

• Solution: artificially extend II w/o perf degradation by unrolling loop body -> Modulo Variable Expansion

• Must unroll at least by a factor k = ceil (v / II)– v = the length of the longest life time

• Modular Scheduling– Modulo Variable Expansion

• But, increased length of kernel code, reg pressure, …• Solution: rotating registers

– Physical register instantiation: combination of a logical identifier and a register base incremented at every iteration

• A reference to register r at iteration i points to a different location than iteration i+1

– It’s possible to avoid modulo variable expansion

Register r1 needs to hold the same variable in twodifferent iterations, but the lifetimes overlap

Unroll kernel twice!

Used two registers (r1, r11) to resolve overlappingSame throughput, but code size hurts

• Modular Scheduling– Iterative Modulo Scheduling

• Sometimes hard to find a schedule due to complex MRT• To improve probability of finding a schedule, allow a

controlled form of backtracking (unscheduling and rescheduling of instructions)

– Advanced Modulo Scheduling Techniques• So far, several heuristics: e.g. guessing a good minII• Recent techniques

– e.g. Hypernode reduction modulo scheduling (HRMS):» reduces loop-variant lifetimes while keeping II constant

• Clustering– Review of the need of clustering

• A practical solution to solve high register demands rather than multiported register file, or bypassing logic

– Multiports are expensive and poor scalability

• A clustered architecture divides into separate clusters• Each cluster has its own register bank and func units• In general, intercluster (explicit) operations needed

– Compilers’ new role• Minimizing intercluster moves and balancing clusters

• Clustering– Preassignment techniques

• In general, clustering before scheduling• Two techniques

– Bottom-up-greedy (BUG)» Two phases: traversing from exit to entry, and assignment

– Partial-component clustering (PCC)» Reduce complexity by constructing macronodes

– Clustering overheads• Two clusters: 15~20% lost cycles, Four: 25~30%

3. Register Allocation

• Register allocation– Memory >> register space– NP-Hard problem– This problem is old and well known

• Standard technique: coloring of interference graph• Recent: nonstandard register allocation techniques

– Faster and better than graph-coloring– linear-scan allocators

» Interested in JIT, dynamic translation

• Tradeoffs b/w compile- and run-time– Feasible today because of faster machines

Phase-ordering Issues

• Phase ordering is hard problem– Should it be don before, after, or same time?– Register allocation and scheduling conflicts goals

• Register allocator tries to minimize spill and restore, creating sequential constraints for register reuse)

• Scheduler tries to fill all parallel units• How to order them?

– Very tricky problem

• Scheduling followed by Register Allocation followed by Post-scheduling– The most popular choice (common for modern RISC)

– ILP over efficient register utilization• Enough registers are available

– Post-scheduler rearranges the code

Scheduling: without regard for the

number of physical registers actually

available

Register allocation:though no allocation

might exist that makes the schedule legal, so insert spills/restores

Post-scheduling: after inserting spills/restores,

fix up schedule, making it legal, with least

possible added cycles

• Register Allocation followed by Scheduling– Register use over exploiting ILP– Works well with few GPRs (e.g. x86)– But, register allocator introduces additional

dependences every time it reuses a register

Register allocation:producing code withall registers assigned

Register allocation:Scheduling (though not very

effectively, because the register allocation has inserted many

false dependences)

• Combined Register Allocation and Scheduling– Potentially very powerful, but very complex– A list-scheduling algorithm may not converge

• Cooperative Approaches– Scheduler monitors register resources and

estimates pressure in its heuristics

Scheduling and register allocation done together:

difficult engineering, and it is difficult to ensure that

scheduling will ever terminate)

4. Speculation and Predication

• Speculation and Predication– Removes and transforms control dependences– Usually, they are independent techniques, and

one is much more appropriate than the other– Note that predication is important in software

pipelining

Control and Data Speculation

• Control and Data Speculation– Recall exception behavior in recovery model

• Nonexcepting parts and sentinel (checking) parts• In compiler’s perspective

– It’s complicated to support nonexcepting loads because of recovery code handling

– Speculative code motion (or code hoisting)• Removes actual control dependences unlike predication• Compiler need to consider supported exception model

and speculative memory operations

Speculative code motion example

load operation becomes speculative load (load.s)

Predicated Execution

• Compiler techniques for predication– Examples: if-conversion, logical reduction of

predicates, reverse if-conversion, and hyperblock-based scheduling

– If-conversion• Translates control dependence into data dependence• Converts an acyclic subset of CFG from an unpredicated

code into straight-line code with predication• Also try to minimize # of predicate values

– logical reduction of predicates

Predicated Execution (cont.)

• Compiler techniques for predication– Reverse if-conversion

• Removing predicates, returning to unpredicated code• May be worthwhile to if-convert• When insufficient predicate registers, selectively

reverse if-converting

– Hyperblock based scheduling• Unified framework for both speculation and predication• First, choose a hyperblock region, then if-conversions

– Gives the schedule constructor much more freedom to schedule, and removes speculative constraints

Example of predicated codesAlways executed

Predicated Execution (cont.)

• Case studies in embedded systems– No usually full predicated like IPF architecture– ARM includes a 4-bit predicates in every operation

• Looks like always being predicated• But, the predicate registers is usual set of condition

code flags instead of an index to general predicates

– TI C6x supports full predication• Five of GPR can be specified as condition registers

Prefetching

• Memory prefetching– A form of speculation, and invisible to programs– Compiler-supported prefetching better than pure

hardware prefetching in many cases• Compiler assist in prefetching

– ISA includes a prefetch instruction• Only hints to the hardware

– Automatic insertion requires to understand loop behaviors

– Unneeded prefetches waste resources

cs8803: compilers for embedded system santosh pande – summer 2007 chapter 8 compiling for vliws...

types of ilp

basic blocksbasic block

particular basic block

basic blockinformationhow

list of basic blocks

schedulingacyclic region

cfgpath profiles

times control

Documents

telkomsel fix pande&fadli

ilp mirassol

instruction level parallelism (ilp) · instruction level...

saurabh pande horoscope

cs8803: advanced digital design for embedded...

ee586 vlsi design partha pande school of eecs washington...

ee434 asic & digital systems partha pande school of eecs...

pande protein

ukm pande besi trenggalek

pande tareekh 1

penygarn community primary school cornerstones curriculum...

prasasti pande kesian

ilp-intro-notesbhagiweb/cs211/lectures/ilp-intro.pdf ·...

tugas mspm pande

cs8803: advanced digital design for embedded...

abhinavabharati - anupa pande

(ilp) handbook

chapter3 ilp

ilp presentation.pptx

pande km diah larassati