the design of the promis compiler—towards multi-level parallelization

18
International Journal of Parallel Programming, Vol. 28, No. 2, 2000 The Design of the PROMIS Compiler Towards Multi-Level Parallelization 1 Hideki Saito, 2 Nicholas J. Stavrakos, 2 Constantine D. Polychronopoulos, 2 and Alex Nicolau 3 Received May 26, 1999; revised November 15, 1999 Most systems that are under design and likely to be built in the future will employ hierarchical organization with many levels of memory hierarchy and parallelism. In order to efficiently utilize the multiple levels of parallelism available in the target architecture, a parallelizing compiler must orchestrate the interactions of fine-grain and coarse-grain program transformations. This article describes issues of multi-grain parallelization and how they are addressed in the PROMIS compiler design. PROMIS is a multilingual, parallelizing, and retargetable compiler with an integrated frontend and backend operating on a single unified and universal intermediate representation. PROMIS exploits mul- tiple levels of static and dynamic parallelism, ranging from task- and loop-level parallelism to instruction-level parallelism, based on a target architecture description. The frontend and the backend are integrated through a unified internal representation common to the high-level, the low-level, and the instruc- tion-level analyses and transformations. KEY WORDS: Compiler; loop parallelization; ILP (instruction-level paral- lelization); IR (internal representation); HTG (hierarchical task graph). 1. INTRODUCTION Most systems that are under design and likely to be built in the future will employ hierarchical organization with many levels of memory hierarchy 195 0885-7458000400-019518.000 2000 Plenum Publishing Corporation 1 This research has been supported in part by a grant from DARPA and NSA, MDA904-96- C-1472, a grant from NSF Next Generation Software, NSF EIA 99-75019, and a grant from Intel Corporation. 2 Center for Supercomputing Research and Development, University of Illinois at Urbana Champaign. E-mail: [ saito.stavrako.cdp]csrd.uiuc.edu. 3 Department of Information and Computer Science, University of California at Irvine. E-mail: nicolauics.uci.edu.

Upload: hideki-saito

Post on 02-Aug-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

International Journal of Parallel Programming, Vol. 28, No. 2, 2000

The Design of the PROMIS Compiler��Towards Multi-Level Parallelization1

Hideki Saito,2 Nicholas J. Stavrakos,2

Constantine D. Polychronopoulos,2 and Alex Nicolau3

Received May 26, 1999; revised November 15, 1999

Most systems that are under design and likely to be built in the future willemploy hierarchical organization with many levels of memory hierarchy andparallelism. In order to efficiently utilize the multiple levels of parallelismavailable in the target architecture, a parallelizing compiler must orchestrate theinteractions of fine-grain and coarse-grain program transformations. This articledescribes issues of multi-grain parallelization and how they are addressed in thePROMIS compiler design. PROMIS is a multilingual, parallelizing, andretargetable compiler with an integrated frontend and backend operating on asingle unified and universal intermediate representation. PROMIS exploits mul-tiple levels of static and dynamic parallelism, ranging from task- and loop-levelparallelism to instruction-level parallelism, based on a target architecturedescription. The frontend and the backend are integrated through a unifiedinternal representation common to the high-level, the low-level, and the instruc-tion-level analyses and transformations.

KEY WORDS: Compiler; loop parallelization; ILP (instruction-level paral-lelization); IR (internal representation); HTG (hierarchical task graph).

1. INTRODUCTION

Most systems that are under design and likely to be built in the future willemploy hierarchical organization with many levels of memory hierarchy

195

0885-7458�00�0400-0195�18.00�0 � 2000 Plenum Publishing Corporation

1 This research has been supported in part by a grant from DARPA and NSA, MDA904-96-C-1472, a grant from NSF Next Generation Software, NSF EIA 99-75019, and a grant fromIntel Corporation.

2 Center for Supercomputing Research and Development, University of Illinois at Urbana��Champaign. E-mail: [saito.stavrako.cdp]�csrd.uiuc.edu.

3 Department of Information and Computer Science, University of California at Irvine.E-mail: nicolau�ics.uci.edu.

File: 828J 029902 . By:XX . Date:08:03:00 . Time:11:01 LOP8M. V8.B. Page 01:01Codes: 2585 Signs: 2128 . Length: 44 pic 2 pts, 186 mm

Fig. 1. A hierarchy of systems.

and parallelism (Fig. 1). While these architectures are evolutional and meetadvances in hardware technology, they pose new challenges in the designof parallelizing compilers. In order to efficiently utilize the multiple levelsof parallelism available in the target architecture, a parallelizing compilermust orchestrate the interactions of fine-grain (or low-level) and coarse-grain (or high-level) program transformations. Such interactions wouldinclude trading-off parallelism between multiple levels and the side-effectsof transforming the program to optimize at a particular level.

In conventional compiler design, exploitation of high-level parallelism(HLP) and instruction-level parallelism (ILP) are performed by two separatecompilers, i.e., the frontend parallelizer ( frontend )(1�3) and the backendoptimizer (backend ).(4) In this article, we argue that this conventional designis not suitable for the next generation of system architectures and describe howthe PROMIS compiler design addresses issues of multi-level parallelization.

PROMIS is a multilingual, parallelizing, and retargetable compilerunder development at the Center for Supercomputing Research andDevelopment (CSRD) at the University of Illinois.(5) PROMIS tackles thechallenges posed by modern architectures through its hierarchical internalrepresentation (IR), the integration of the frontend and the backend via asingle unified and universal IR (or UIR), and extensive symbolic analysis.

The hierarchical IR provides a natural mapping for exploitation ofmulti-level memory hierarchy and parallelism.(6) The frontend-backendintegration via the unified IR enables the propagation of more informationfrom the frontend to the backend, which in turn helps achieve a synergisticeffect on the performance of the generated code.(7) Symbolic analysis notonly produces control flow sensitive information to improve the effective-ness of the existing analysis and optimization techniques, but also quan-titatively guides program optimizations to resolve many tradeoffs.(8)

196 Saito, Stavrakos, Polychronopoulos, and Nicolau

The rest of this article is organized as follows: Section 2 studies issuesof multi-level parallelization. Section 3 gives an overview of the PROMIScompiler and PROMIS IR, and describes how its design addresses theissues discussed in Section 2. Section 4 presents our initial results from theProof-of-Concept prototype implementation. Related work is reviewed inSection 5. Finally, Section 6 describes our conclusion and future work.

2. MULTI-LEVEL PARALLELIZATION

In order to utilize a hierarchy of parallelism in modern systemarchitectures, a parallelizing compiler must decompose the compiledprogram into multiple granularities. Multi-level parallelization is notnecessarily a new concept, however, the issues involved in multi-grainparallelization have yet to be extensively studied.(9�12) Furthermore, therapid increase in both HLP and ILP in systems ranging from supercom-puters to high-end PCs makes this an important topic of interest.

Suppose that the target architecture has two levels of parallelism (e.g.,a superscalar-based multiprocessor), and the frontend and the backendindividually yield speedups of M and N, respectively, for a given program.One might naively expect that the net speedup of both HLP and ILP tobe in the neighborhood of M_N. However, performance projection ofmulti-level parallelization is not so trivial. The net speedup can actuallyvary from much higher than M_N to much lower, depending on the com-piled program.(7) In many cases, such deviations can be explained by sim-ple calculation. Table I illustrates an example where individual loopsachieve multiplicative speedups, but the speedup for the entire program isfar from achieving it. In this case, the net speedup is bounded by the moresequential Loop A (Amdahl's Law), while neither ILP-only nor HLP-onlyspeedups are close enough to the upper bound limited by the Loop A.[Note: ILP, HLP, and Net speedups of the entire program becomes 20, 20,and 40, respectively, if Loop B speedup were infinite.] The rest of thearticle focuses on the behaviors that are not explained by this simplemathematics, especially those cases where a single loop fails to achieve amultiplicative speedup.

Table I. ILP, HLP, and Net Speedups for a Hypothetical Program

Loop 0 Seq. Exec Time ILP Speedup HLP Speedup Net Speedup

Loop A 100 2 2 4Loop B 900 10 20 200Total 1000 7.2 11 34

197Design of PROMIS Compiler

File: 828J 029904 . By:XX . Date:08:03:00 . Time:11:01 LOP8M. V8.B. Page 01:01Codes: 2072 Signs: 1528 . Length: 44 pic 2 pts, 186 mm

Fig. 2. A code segment from TOMCATV main loop.

One of the reasons for the net speedup failing to achieve the apparentmultiplicative M_N is a conflict of interest between the frontend and thebackend. The former tries to maximize the benefit of HLP, while the latterattempts to maximize ILP. When the parallelism available in the programis limited, there won't be enough parallelism left to achieve a good ILPspeedup after HLP is maximally exploited. That would lead to less netspeedup, lower functional unit usage, and poor system throughput.

For example, a part of the main loop of TOMCATV (in SPEC FP95)is the code segment shown in Fig. 2. A parallelizing compiler (or an earlyphase of the backend compiler) transforms the parallel loop DO 110 intoa call to a multithreading library function DOALL() and a function defini-tion LOOP�110() (See Fig. 3). [Note: This example uses a simplifiedversion of IML�DOALL() API.(10)].

In order to parallelize the DO 110 loop for a modern multiprocessorsystem, the compiler would have to determine both of the following at thesame time; (1) how many iterations should be distributed across multipleprocessors; and (2) how many iterations are needed to sustain a good IPC(instruction per cycle) rate through unrolling and software pipelining. In adynamic execution environment (e.g., in a multiprogrammed workloadwithout gang scheduling), runtime load-balancing scheme, such as guided

Fig. 3. Parallelized Loop 110 in the backend.

198 Saito, Stavrakos, Polychronopoulos, and Nicolau

File: 828J 029905 . By:XX . Date:08:03:00 . Time:11:01 LOP8M. V8.B. Page 01:01Codes: 2357 Signs: 1810 . Length: 44 pic 2 pts, 186 mm

Fig. 4. Parallelized Loop 120 in the backend.

self scheduling, would favor a smaller minimum chunk size (i.e., more itera-tions for inter-processor parallelism). On the other hand, a good IPCthrough unrolling and software pipelining would need more iterations forintra-processor parallelism so as to amortize the lower IPC of the epilogueand prologue parts. A good trade-off must be found when these two goalsare not simultaneously satisfiable.

Another reason for the less than expected performance is the programtransformations conducted by the frontend interfering with the analysesand transformations performed by the backend. In general, the frontendtransforms an input program into a more complex (but highly optimized)form. This transformation usually makes backend analyses and transforma-tions work less effectively than operating on the original input program.For example, a good frontend transforms the DO 120 loop nest in Fig. 2into a call to DOALL(), as opposed to naively transforming the DO 121loop (see Fig. 4). [Note that a naive loop interchange destroys locality ofmemory reference.] In this case, the frontend can reasonably choose asmall minimum chunk size (e.g., a few iterations). However, the backendwould not try to unroll (or apply software pipelining to) the outer DO 120loop due to the lack of such knowledge.

These observations indicate that the conventional independencebetween the frontend and the backend does not accommodate modernsystem architectures very well.

3. PROMIS AND PROMIS IR

3.1. The Design of the PROMIS Compiler

The PROMIS compiler is a multilingual, parallelizing, and retargetablecompiler with an integrated frontend and backend operating on a singleunified and universal IR (or UIR). Unlike most other compilers, PROMIS

199Design of PROMIS Compiler

File: 828J 029906 . By:XX . Date:08:03:00 . Time:11:01 LOP8M. V8.B. Page 01:01Codes: 1874 Signs: 1432 . Length: 44 pic 2 pts, 186 mm

exploits multiple levels of static and dynamic parallelism ranging fromtask- and loop-level parallelism to instruction-level parallelism, based on atarget architecture description.

Figure 5 shows the organization of the PROMIS compiler. The coreof the compiler is the unified and universal hierarchical representation ofthe program. Both the frontend and the backend analysis and optimizationtechniques, driven by the description of the target architecture, manipulatethis common UIR. Support for symbolic analysis is an integral part of theUIR, which provides control sensitive information throughout the compila-tion process. PROMIS supports C, C++, FORTRAN77, and Java byte-code as input languages and can target wide variety of systems, such asCISCs, RISCs, and DSPs.

The integrated frontend and backend provides several importantopportinities for high-level�low-level interactions and trade-offs that areeither very difficult or impossible to achieve effectively in conventionalcompilers. Some of the advantages of a unified and integrated approachinclude the following:

v Instruction level parallelization (in the backend) can be based oninformation about the semantics of the source language and algo-rithm (when available) that is normally not available to compilerbackends. This information can be used to eliminate spurious

Fig. 5. Overview of the PROMIS compiler.

200 Saito, Stavrakos, Polychronopoulos, and Nicolau

dependencies in the backend that either cannot be removed or aretoo expensive to remove via low-level analysis alone.

v Instruction level parallelization (in the backend) can make use ofhigh-level transformations (such as loop interchange, loop fusion,etc.) that have the effect of increasing the availability of ILP.

v Context sensitive trade-offs can be made between HLP and ILPwhen necessary. Some transformations, such as loop interchange,can have the effect of increasing one type of parallelism at theexpense of the other (e.g., outer-loop vs. inner-loop parallelism).Integration facilitates context sensitive trade-offs by providing aframework in which different granularities of parallelism can be triedand compared to each other.

v Efficiency and effectiveness of the compiler is improved. In the con-ventional compiler framework, almost all program informationderived through time-consuming analysis phases in the frontend isdiscarded when the intermediate code is passed down to the back-end. Given a more complex (but optimized) form of the originalprogram, the backend fails to re-analyze it to the same level ofaccuracy and thus produces non-optimal code. However, ourintegrated framework passes analyzed information down to thebackend (i.e., no need for re-analysis) and provides for higheraccuracy for generating better code.

3.2. The Design of the PROMIS IR

In order to facilitate the integration of the frontend and the backend,the PROMIS compiler uses a common, unified internal representation,which maintains all vital program structures. For the efficient operation ofthe compiler, such a unified IR must be capable of representing theprogram at both levels. In addition, the IR of a compiler that targetshierarchical systems must have a natural mapping of program hierarchyinto its structure. One such hierarchical representation is the HierarchicalTask Graph (HTG).

3.2.1. The Hierarchical Task Graph

Hierarchical Task Graph (HTG)(6) is a directed acyclic graphG=(V, E ) where V is a set of nodes and E is a set of edges that representcontrol flow through the nodes. The set V contains the following five typesof nodes:

1. A unique start node that has no incoming edges dominates allother nodes in V.

201Design of PROMIS Compiler

File: 828J 029908 . By:XX . Date:08:03:00 . Time:11:02 LOP8M. V8.B. Page 01:01Codes: 1353 Signs: 846 . Length: 44 pic 2 pts, 186 mm

2. A unique stop node that has no outgoing edges post-dominates allother nodes in V.

3. Simple nodes represent tasks that have no sub-tasks.

4. Loop nodes represent loops whose loop bodies are sub-HTGs(HTGs contained within other HTGs).

5. Compound nodes represent sub-HTGs other than loops. A com-pound node X represents the HTG G(X )=(V(X ), E(X )). Com-pound nodes are used for single-entry single-exit (SESE) nonloopconstructs (e.g., basic blocks and subroutines).

Figure 6 illustrates the hierarchical nature of the various types ofnodes. Nodes A and B are loop nodes at the top level of the hierarchy.Node C is a loop node in the next level. Node D is a compound node atthe top level. Each basic block is a sub-HTG that contains a sequence ofsimple nodes.

Fig. 6. A hierarchical task graph from Ref. 6.

202 Saito, Stavrakos, Polychronopoulos, and Nicolau

A key characteristic of the HTG is the ability of a node to summarizethe information (such as data dependence) of all the nodes it contains. Thenode can then be treated as one atomic unit, without the need for each ofthe contained nodes to be visited.

The HTG has been successfully used both in a frontend parallelizer(1)

and in a backend compiler.(13) In the HTG, hierarchical nodes capture thehierarchy of program statements, and hierarchical dependence edges repre-sent the dependence structure between tasks at the corresponding level ofhierarchy. Therefore, parallelism can be exploited at each level of the HTG:between statements (or instructions), between blocks of statements,between blocks of blocks of statements, and so on. This flexibility promotesa natural mapping of the parallelism onto the hierarchy of the targetarchitecture.

In the PROMIS implementation of the HTG, simple nodes are furtherclassified as AssignStmt (e.g., x=expr), CallStmt (e.g., x=foo(arg1,arg2,...)), and PointerStmt (e.g., *p=expr). Likewise, loop nodes aredivided into two categories: GeneralLoop (generic repeat-until loops) andFixedIterLoop (FORTRAN-like DO loops without premature exits). Theseclassification simplifies the identification of programming constructs impor-tant to program optimizations.

3.2.2. Support for Frontend-Backend Integration

Enhanced support for integrated compilation in PROMIS is enabledby the UIR. The UIR propagates vital dependence information obtained inthe frontend to the backend. The backend, for example, does not need toperform memory disambiguation since the data dependence informationfrom the frontend replaces it with higher accuracy. Backend optimizationtechniques that rely on accurate memory disambiguation can work moreeffectively in the integrated compiler.

The PROMIS IR has three distinctive levels of representation: high-level (HUIR), low-level (LUIR), and instruction-level (IUIR). Althoughthe UIR can be at any arbitrary sub-level between the HUIR and theLUIR during the course of the IR lowering process (and also between theLUIR and IUIR), the focus of the current development effort is given tothe three major levels. In the PROMIS IR, statements are represented asHTG nodes.

The abstract syntax trees from the parser can have arbitrarily complexexpression trees. During the construction of the HUIR, expression trees arenormalized to have a single point of side effects per statement. Functioncalls and assignments to pointer dereferences are identified and isolated asseparate statements. During IR lowering (from HUIR to LUIR), complexexpression trees are broken down into collections of simple expression

203Design of PROMIS Compiler

trees, each of which is similar to quadruples. Data dependence informationis maintained and propagated throughout the lowering process. Therefore,the PROMIS backend utilizes the same quality of dependence informationas the frontend, unlike conventional compilers.

Figure 7a shows a HUIR representation of the statement a[i]=b*c.At the leaf-level of the HTG, there is an AssignStmt node correspondingto this assignment statement. The associated expression tree gives thesemantics of the statement. For the sake of simplicity, the left-hand side ofan assignment operator is always an address. This is also true when thevalue is assigned to a virtual (or physical) register that technically doesn'thave an address. Data dependence edges connect the source and thedestination expressions of data dependence for this expression tree.Hierarchical data dependence edges connect the source and the destinationHTG nodes, summarizing detailed data dependence information providedby the data dependence edges. Figure 7b is a part of the LUIR corre-sponding to Fig. 7a. In this example, IR lowering is performed for register-register type architectures. During the lowering process, local dependenceinformation is generated and nonlocal dependence information is updatedto reflect the lowering. Since the statements are already normalized duringHUIR construction, it is straightforward to perform IR lowering whilemaintaining the accuracy of dependence information.

3.3. The IML Multithreading Runtime Library

The problem of multi-level parallelization also affects the design of themultithreading runtime library interface. IML (Illinois-Intel MultithreadingLibrary), co-developed by CSRD and Intel Corporation, is a runtimelibrary that provides a unified support for loop and functionalparallelism.(10) The IML�DOALL() API function in IML allows an alter-nate loop body (LOOP�120�2 in Fig. 8). When the number of iterationsfor a (loop iteration) scheduling event is the same as the minimum chunksize, the runtime task scheduler uses this alternate loop body. Given thealternate loop body (and thus knowing the number of iterations for theinner loop), the backend should be able to exploit enough ILP using bothloops.

4. INITIAL RESULTS FROM THE PROOF-OF-CONCEPTIMPLEMENTATION

This section of the article presents the initial results from our Proof-of-Concept (POC) prototype implementation (Fig. 9) (5) of the PROMIScompiler. The POC prototype compiler is based on the Parafrase-2(1) and

204 Saito, Stavrakos, Polychronopoulos, and Nicolau

File: 828J 029911 . By:XX . Date:26:01:00 . Time:09:39 LOP8M. V8.B. Page 01:01Codes: 504 Signs: 79 . Length: 44 pic 2 pts, 186 mm

205Design of PROMIS Compiler

Fig

.7.

Hig

h-le

vel,

low

-lev

el,

and

inst

ruct

ion-

leve

lIR

.

File: 828J 029912 . By:XX . Date:08:03:00 . Time:11:03 LOP8M. V8.B. Page 01:01Codes: 1566 Signs: 1039 . Length: 44 pic 2 pts, 186 mm

Fig. 8. Loop 120 of Fig. 2 parallelized with IML.

EVE(13) compilers. The two compilers are connected with semantic reten-tion assertions inserted in the source code generated by Parafrase-2. Foreach data dependence edge, Parafrase-2 inserts an assertion to specifythat the two memory references (e.g., two elements of the same array)may access the same location. In other words, two memory referencesare proven to be independent unless there is an assertion to indicateotherwise. EVE makes use of this knowledge to refine its data dependenceinformation.

Note that the POC compiler does not achieve a full integration of thefrontend and the backend. For example, Parafrase-2 does not supplydependence directions to EVE. In general, what kind of frontend informa-tion is useful in the backend in what degree is still unknown. Experimentson the PROMIS compiler will quantitatively provide answers to some ofthese important questions for future compiler designs.

Fig. 9. Proof-of-Concept Prototype.

206 Saito, Stavrakos, Polychronopoulos, and Nicolau

4.1. Target Specification and Benchmark Programs

In this experiment, a VLIW�MIMD simulator is used as the targetarchitecture. The simulator has six processors connected through perfectshared memory (i.e., short latency, no cache, and no bus contention). TheVLIW processor is 18-wide: three each of integer ALU, floating pointALU, shift, floating point multiply, floating point divide, and memory(LOAD�STORE) unit. The processor has a single register file and RISC-like multi-cycle operations.

The experiments used TOMCATV, SWIM, and MGRID (all fromSPEC FP95) and the twenty-three Livermore Loops. [Note: LL16 isexcluded due to incorrect results.] Since most of the Livermore Loopscontain one main loop, the effect of parallelization on a particular loop iseasily observed. The larger benchmarks show the effect of parallelization ona more realistic mix of loops and straight line code, although the inter-actions between loop-level and instruction-level parallelization are harderto analyze.

4.2. Experimental Results

Table II shows the speedups (sequential cycles�parallel cycles) of thecode generated by the POC compiler. These numbers are for nine of theLivermore Loops and the three SPEC progarms. The POC compiler failed

Table II. Speedup Results on the Proof-of-Concept Compiler

ILP-only Loop-only Multi-grain Effectiveness(M ) (N ) (X ) (X�(M_N ))

LL1 4.88 5.85 27.39 960LL7 4.23 5.76 24.22 990LL8 11.46 4.89 45.39 810LL9 5.74 5.62 30.5 950LL12 7.46 5.89 42.43 970LL14 3.43 1.7 7.22 1240LL18 4.91 4.98 30.02 1230LL21 8.09 5.49 42.85 960LL22 5.29 5.77 24.13 79023 LL LoopAverage

4.44 2.50 13.55 950

TOMCATV 5.44 3.47 12.46 660MGRID 3.52 4.27 14.03 930SWIM 6.25 4.18 23.08 880

207Design of PROMIS Compiler

Table III. Speedup for LL14 by Individual Loops

ILP-only Loop-only Multi-grain Effectiveness

1st Loop 8.44 1.00 8.44 10002nd Loop 2.60 5.86 15.00 9803rd Loop 4.88 1.00 4.88 1000LL14 total 3.43 1.70 7.22 1240

to produce any loop-level parallel speedups on other Livermore Loops.Some of them are actually serial, and others are beyond the analysis powerof the POC compiler.

The first column (M ) shows the instruction-level parallel speedupsobtained with loop-level parallelization disabled. The second column (N )shows the loop-level parallel speedups obtained with instruction-levelparallelization disabled. The third column (X ) is the multi-grain speedup,exploiting both loop- and instruction-level parallelism. The last columnindicates the effectiveness of multi-grain parallelization, that is, how closethe multi-grain speedup X is to the multiplication M_N of ILP-onlyspeedup M and Loop-only speedup N.

These numbers exhibit a high level of efficiency. Most of theseprograms achieved more than 900, suggesting that ILP and loopparallelism can co-exist without much interference. However, as more andmore parallelism is being exploited at both ends, there will be cases likeLL8, LL22, and TOMCATV where parallelism is too limited to satisfyboth ILP and loop parallelism. [Note: The POC compiler finds a singlynested DOALL loop in each of LL8 and LL22.] In such cases, the front-end and the backend have to work closely together to achieve optimaltrade-offs.

LL14 and LL18 resulted in greater than 1000 effectiveness. These arethe cases where individual loops achieve near-perfect effectiveness, but thecombined effectiveness is still well above 1000 (Table III). All loops in aprogram are not necessarily parallelized equally. Some are DOALL loops,some are not, and some have more exploitable ILP than others.

5. RELATED WORK

PROMIS is the successor of the Parafrase-2 Compiler(1) and the EVECompiler.(13) The PROMIS Proof-of-Concept prototype compiler(5) is thecombination of these two compilers. The POC compiler uses semanticretention assertions to propagate data dependence information from

208 Saito, Stavrakos, Polychronopoulos, and Nicolau

File: 828J 029915 . By:XX . Date:08:03:00 . Time:11:04 LOP8M. V8.B. Page 01:01Codes: 2947 Signs: 2430 . Length: 44 pic 2 pts, 186 mm

Parafrase-2 (frontend) to EVE (backend). Experimental results on thePOC compiler indicate that propagating high-level data dependence infor-mation to the backend leads to higher performance and underscore thesignificance of tradeoffs between inter-processor and intra-processorparallelism.(9) The unified PROMIS IR propagates all dependence informa-tion computed at the frontend to the backend, and static�dynamicgranularity control is used to achieve better parallelism tradeoffs.Parafrase-2 is the first multilingual compiler aimed at FORTRAN77, C,and PASCAL. PROMIS inherits its concept in a new implementation, andcurrently aims at FORTRAN77, C, C++, and Java (source�bytecode).

Another compiler effort aiming at similar goals is the National Com-piler Infrastructure.(14) The SUIF-component of NCI is based on the inter-mediate program format called SUIF,(15) and analysis and optimizationmodules which operate on SUIF. SUIF is an abstract syntax representa-tion of the program (statement-level or instruction-level) to be passed bya compiler pass to another (Fig. 10). Data dependence information, whichis critical to optimization, can be encoded in the form of annotation(16) thatcan be ignored and recomputed. The SUIF compiler modules communicateusing intermediate output files. The SUIF compiler system aims at inde-pendent development of compiler modules while the PROMIS compileremploys an integrated design approach. Trade-off issues between thefrontend and the backend are harder to tackle in a disjoint developmentenvironment.

The Ardent Compiler, which later became the Stardent Titan Com-piler, is very similar to PROMIS. It transfers the data dependence graphfrom the frontend vectorizer to the register allocator and the instructionscheduler,(17) and it also exploits multi-grain parallelism.(18) Unfortunately,there was no extensive study published on the effect of dependence infor-mation transfer. Compiler technology and architecture design havesignificantly evolved since then, and the importance of this approach hasbeen dramatically increased.

Cho et al. propose a High-Level Information (HLI) format to transferinformation from the frontend to the backend.(19) Similar to SUIF, theireffort is aimed at disjoint development of frontend and backend. The HLIframework has been implemented on the SUIF frontend and the GCC

Fig. 10. A SUIF compiler organization.

209Design of PROMIS Compiler

backend, and a new frontend is under development. HLI encodes informa-tion on line numbers, scopes, equivalent memory accesses, aliases, andloop-carried data dependence. Using the SUIF frontend, HLI can eliminateup to 500 of dependence edges computed by GCC.(19) The result is notsurprising since dependence analysis is known to be much weaker in GCCthan the state-of-the-art parallelizers. With respect to application perfor-mance improvement, our POC compiler shows better results, indicatingthat high-performance backends benefit more from accurate dependenceinformation than GCC does. [Note: Since these work involve differentfrontends and backends, this comparison is not completely fair. However,since EVE is based on GCC and the dependence structures of the innermost loops are rather simple, improvement on the accuracy of datadependence should be comparable. Therefore, we claim the comparison ismeaningful.] Their work specifically targets data dependence information,while our research investigates many aspects of integration.

6. CONCLUSIONS

As computer systems adopt more complex architectures with multiplelevels of parallelism and deep memory hierarchies, code generation andoptimization becomes an even more challenging problem. With theproliferation of parallel architectures, automatic or user-guided paralleliza-tion becomes relevant for systems ranging from high-end PCs to supercom-puters. In this article, we presented the issues on multi-grain parallelizationand how the PROMIS compiler design addresses these issues.

The PROMIS compiler encompasses automatic parallelization andoptimization at all granularity levels, and in particular at the loop andinstruction level. In this paper, we argued against the conventional inde-pendence of the compiler's frontend and backend, and proposed a full-integration of the frontend and the backend. Based on the preliminaryresults obtained from the Proof-of-Concept prototype, we believe that ourunique approach to full integration of the frontend and the backendthrough a common IR together with aggressive pointer and symbolicanalysis will amount to significant performance improvements over thatachieved by separate parallelizers and ILP code generators using equallypowerful algorithms.

Choosing an appropriate internal representation for a compiler is a firstcritical design step. The HTG summarizes control flow, data dependence,and other information for groups of nodes and therefore improves theefficiency of many analyses and transformations. The hierarchy of the HTGis constructed from the structure of the program. This provides a naturalstarting point for exploiting multiple levels of parallelism in the program.

210 Saito, Stavrakos, Polychronopoulos, and Nicolau

ACKNOWLEDGMENTS

The authors are grateful to Carrie Brownhill and Steve Novack ofUniversity of California at Irvine. Section 4 is largely based on their workon the Proof-of-Concept prototype. We would also like to thank theirPROMIS compiler associates Prof. Nikhil Dutt, Peter Grun, AshokHalambi, and Nick Savoiu; also of University of California at Irvine, JeffBrokish, Steven Carroll, Fred Jacobs, Peter Kalogiannis, Walden Ko,Chris Koopmans, and Kazushi Marukawa of CSRD for their contributionto PROMIS.

REFERENCES

1. Constantine D. Polychronopoulos, Milind Girkar, Mohammad Reza Haghighat, ChiaLing Lee, Bruce Leung, and Dale Schouten, Parafrase-2: An Environment for Paralleliz-ing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors, Int'l. J.High Speed Computing 1(1):45�72 (1989).

2. Bill Blume, Rudolf Eigenmann, Keith Faigen, John Grout, Jay Hoeflinger, David Padua,Paul Petersen, Bill Pottenger, Lawrence Rauchwerger, Peng Tu, and Stephen Weather-ford, Polaris: The Next Generation in Parallelizing Compilers. Technical Report 1375,Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign (1994).

3. Kuck and Associates, Inc. Kuck and Associates, Inc. Home Page. http:��www.kai.com.4. Hideki Saito, On the Design of High-Performance Compilers. Technical Report, Center

for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Ph.D. Thesis Proposal (May 1999).

5. Hideki Saito, Nicholas Stavrakos, Steve Carroll, Constantine Polychronopoulos, and AlexNicolau, The Design of the PROMIS Compiler, Proc. Int'l. Conf. Compiler Construction(CC) (March 1999).

6. Milind Girkar and Constantine D. Polychronopoulos, The Hierarchical Task Graph as aUniversal Intermediate Representation, IJPP 22(5):519�551 (1994).

7. Carrie Brownhill, Alex Nicolau, Steve Novack, and Constantine Polychronopoulos, ThePROMIS Compiler Prototype, Proc. Int'l. Conf. Parallel Architectures and CompilationTechniques (PACT) (1997).

8. Mohammad R. Haghighat, Symbolic Analysis for Parallelizing Compilers, KluwerAcademic Publishers (1995).

9. Carrie Brownhill, Alex Nicolau, Steve Novack, and Constantine Polychronopoulos,Achieving Multi-Level Parallelization, Proc. Int'l. Symp. High Performance Computing(ISHPC) (1997).

10. Hideki Saito, Nicholas Stavrakos, and Constantine Polychronopoulos, MultithreadingRuntime Support for Loop and Functional Parallelism, Proc. Int'l. Symp. High Perfor-mance Computing (ISHPC) (May 1999).

11. Xavier Martorell, Eduard Ayguade� , Nacho Navarro, Julita Corbala� n, Marc Gonza� lez,and Jesu� s Labarta, Thread Fork�Join Techniques for Multi-Level Parallelism Exploita-tion in Numa Multiprocessors, Proc. Int'l. Conf. Supercomputing (1999).

12. Donald Yeung, The Scalability of Multigrain Systems, Proc. Int'l. Conf. Supercomputing(1999).

211Design of PROMIS Compiler

13. Steve Novack, The EVE Mutation Scheduling Compiler: Adaptive Code Generation forAdvanced Microprocessors, Ph.D. thesis, University of California at Irvine (1997).

14. The National Compiler Infrastructure Project. The National Compiler InfrastructureProject. http:��www-suif.stanford.edu�suif�NCI (January 1998). [Also at http:��www.cs.virginia.edu�nci.]

15. Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam, Maximizing Multiprocessor Perfor-mance with the SUIF Compiler, IEEE Computer (December 1996).

16. Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe,Jennifer M. Anderson, Steve W. K. Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary W.Hall, Monica S. Lam, and John L. Hennessy, SUIF: An Infrastructure for Research onParallelizing and Optimizing Compilers. Technical Report, Computer Systems Labora-tory, Stanford University [http:��suif.stanford.edu�suif�suif1�suif-overview�suif.html].

17. Randy Allen, Unifying Vectorization, Parallelization, and Optimization: The ArdentCompiler, Proc. Int'l. Conf. Supercomputing (1988).

18. Randy Allen, Exploiting Multiple Granularities of Parallelism in a Compiler, Proc. COM-PCON, pp. 634�640 (1990).

19. Sangyeun Cho, Jenn-Yuan Tsai, Yonghong Song, Bixia Zheng, Stephan J. Schwinn, XinWang, Qing Zhao, Zhiyuan Li, David J. Lilja, and Pen-Chung Yew, High-Level Informa-tion��An Approach for Integrating Front-End and Back-End Compilers, Proc. Int'l. Conf.Parallel Processing (ICPP) (1998).

Printed in Belgium

212 Saito, Stavrakos, Polychronopoulos, and Nicolau