occam-light { a multiparadigm programming language for

OCCAM-light { A Multiparadigm ProgrammingLanguage for Transputer Networks(revised version)Alf Wachsmann1, Friedrich Wichmann21Universit�at-GH PaderbornHeinz Nixdorf Institut undFachbereich Mathematik-InformatikD-33095 Paderbornemail: [email protected] 2Universit�at-GH PaderbornFachbereich Mathematik-InformatikD-33095 Paderbornemail: [email protected] IntroductionTwo di�erent paradigms for parallel computers are currently used. In the �rst one, calledparallel computer networks or networks for short, processors communicate by neighbour-to-neighbour message passing via a network, typically working asynchronously. The second oneconnects the processors with a shared memory so that information is exchanged via accessto \public variables". This machine type, called PRAM as a theoretical model, works syn-chronously and has the great advantage of easy programming. On the other hand, currenttechnology does not allow a direct realization of (large) PRAMs in hardware. All real ma-chines with shared memory use some kind of simulation. Both machine types are re ected inprogramming languages with communication facilities of either the message passing or sharedmemory type.There have been several approaches which tried to support a pure PRAM programming styleon networks, e.g. in [Sei90] or in the language \pm2" [Juv92]. A general purpose data paral-lelism language with a logically shared data space called \Modula-2�" was proposed in [PT91].Another idea is BSP-Occam [All91] which supplements occam 2 with a library of proceduresfor \bulk-synchronous parallel computers" suggested in [Val89]. This concept does not inte-grate the shared memory features into occam 2, additionally it prohibits the standard use ofchannel communication between processors.Our goal is to combine the two paradigms, networks and PRAMs, into a comfortable pro-gramming language for Transputer networks, a representative of the network model, withouttoo much loss of e�ciency. To do this, we need both constructs for network programmingand some features for the use of public variables, synchronization, and input/output. Ourapproach is based on the Transputer language occam 2 [INM88] which gives us all the net-work capabilities. To this basis, we add the features described above by implementing aprobabilistic shared memory simulation which uses universal hashing for the distribution ofthe public variables, and routing for their accesses. We call our new language OCCAM-lightbecause programming becomes less di�cult compared with occam 2. OCCAM-light has beenimplemented and is being tested to gain some experience of its usability and e�ciency.In the following chapter, we describe the ideas of OCCAM-light. The OCCAM-light compilerand the implementation of the \synchronization and shared memory library" are dealt within Chapters 3 and 4. Some examples and results of experiments are given in Chapter 5. Inthe last chapter, we give an outlook to our future plans concerning this project.1

2 Principles of OCCAM-lightThe goal in the development of OCCAM-light is to support both the message passing andshared memory paradigm on multi-processor networks. The following reasons lead us tothis decision. In many cases parallel algorithms for multi-processor networks need local andglobal communication. The message passing language occam 2 restricts the use of messagesto exchange of information between neighbours in the available network topology. Althoughthis topology may be con�gurable for Transputer systems it still depends on hardware re-strictions. If possible, a programmer will choose the topology following the algorithm's localcommunication structure, but then it is still necessary to write routing routines for globalcommunication or for the case of changing communication structure. In a synchronous mes-sage passing model, bu�ering of data must be coded by the programmer, too. This showsthat message passing in languages like occam 2 is an e�cient but low-level communicationparadigm, which is rather di�cult to use.On the other hand, the theoretical PRAM model has found wide-spread use in parallel algo-rithm design. The shared memory communication paradigm hides all details of communica-tion as network topology, routing, or hardware features. There must be support for synchro-nization, too, to allow a high-level programming style with shared memory communication.But this model is not realizable for large numbers of processors. On existing multi-processornetworks e�cient parallel programs must make use of fast neighbour communication, whichcan not be expressed in the shared memory paradigm. Thus, if one wants to allow e�cientprogramming on distributed machines as well as the use of the more comfortable PRAMmodel, both paradigms should be supported.As well as global communication some similar operations are not supported directly in puremessage passing languages. This leads to a situation where every programmer starts to writecode for the simulation of the operations again. Among these often needed operations areremote access to shared information, routing problems, synchronization of subsets of processesand input/output operations for �les from every node.In OCCAM-light we provided the programmer with both means of communication: sharedmemory variables and point-to-point message passing. Also, a set of library procedures forsynchronization and input/output operations can be used by every processor node in thenetwork.We have chosen the language occam 2 as basis because it is in widespread use for multi-processor networks using Transputers and supports parallel programming with point-to-pointmessage passing. A compiler translates OCCAM-light into occam 2 programs with calls ofprocedures in a \synchronization and shared memory library".The \public variables" of the virtual shared memory are mapped to processors by a pseudo-random hashing scheme to achieve a uniform distribution. If one is interested in PRAM styleprogramming the use of shared memory in OCCAM-light provides an easy solution.The new features of the programming language OCCAM-light are further explained in thefollowing sections. 2

2.1 Public VariablesTo use the shared memory extension of OCCAM-light we introduce public variables. They canbe accessed by every process in the network and are declared by adding the pre�x PUBLICto the type declaration. Arrays of public variables are possible, too. Channel types, the typeTIMER, and more than one occurrence of PUBLIC are not allowed in a public declaration.Fig. 1 shows the syntax of type declarations in OCCAM-light.type ::= primitive.type j array.type j public.type .primitive.type ::= 'CHAN' 'OF' protocol j 'TIMER' j 'BOOL' j 'BYTE' j 'INT'j 'INT16' j 'INT32' j 'INT64' j 'REAL' j 'REAL32' j 'REAL64'.array.type ::= table.list type .public.type ::= 'PUBLIC' type .Figure 1: Syntax of type declarationsPublic variables can be read and written by ordinary assignment statements in their scopewhich is determined in the same way as in occam 2. There are two types of public variables:Global public variables are accessible by all processes and processors of the network. Theymust be declared globally, outside of the node procedure containing the code for one physicalprocessor (see Section 2.4), and will be distributed over the network by the shared memorysimulation. Local public variables, declared locally inside a procedure, are mapped to the pro-cessor where they are de�ned and can be used for communication between parallel processesin one node of the network.PUBLIC REAL64 r: - - a public oat variable[7] PUBLIC INT elems: - - array of seven public variablesPUBLIC [5] INT row: - - �ve INTs stored in one public variable[10] PUBLIC [5] INT tenrows: - - an integer matrix with distributed rowsPARr:=0 - - r is set to 0 or 1, depending onr:=1 - - the relative speed of processesrow:=tenrows[3] - - whole public variables are read and writtentenrows[3][1]:=elems[0] - - parts of public variables are accessedFigure 2: Examples for public variablesThe examples in Fig. 2 show declaration and use of public variables, which is very similarto ordinary occam 2 variables. For structured public variables the de�nition speci�es thegranularity of their distribution in the shared memory: In declarations of public arrays thede�nition part after the keyword PUBLIC de�nes units of the public variable that are mappedto the same processor. Thus, the programmer can e.g. determine whether single elementsor rows of a matrix are stored as variables in the network or the matrix as a whole is onepublic variable. This decision should be made with respect to typical access patterns for thevariables. Small distribution units increase parallelism by distribution of data, while accessof larger data units yields less overhead in routing and message passing operations.Read and write access to parts of the distribution units is possible if corresponding indexexpressions are added. Of course, access to entire arrays of public components is not allowed,i.e. index expressions for dimensions preceding PUBLIC must be given.A process reading from or updating a public variable terminates when the operation is suc-cessfully executed and the result or the acknowledgement is returned. The order of operations3

accessing public variables in parallel processes is not de�ned. This is similar to an \arbitrary"write con ict resolution rule in PRAM programs using the concurrent read concurrent writemode.While occam 2 procedures can be used without restrictions, occam 2 functions may not haveany side e�ects. Thus, public variables and other OCCAM-light features cannot be used insidefunctions, too. There are some minor restrictions for the use of public variables in the currentcompiler implementation. E.g., public variables may not be used in multiple assignments oractual parameter lists. This can be avoided by assigning public variables to local temporaryvariables �rst.2.2 Grouping Concept and SynchronizationMany PRAM algorithms use a grouping concept, which means that subgroups of processorscompute independently parts of the result, without communication between them. Thesealgorithms typically use barrier synchronization (see e.g. [And91]) inside the subgroups ofprocessors. In this kind of synchronization (subgroups of) processors reach a point in theirprogram called \synchronization point" or \barrier" which all other processors (of the sub-group) have to reach before the computation resumes.There are many di�erent ways to express grouping in occam 2, where we deal with processesrather than PRAM processors. Processes have to be statically declared by PAR statements,which may be used in an enumerating form or with replication by a constant factor. Theprocesses declared in this way can be grouped by program structure in the enumerating form,by conditions involving the replication variable in the second form, or generally by conditionalstatements possibly depending on input values. OCCAM-light allows all these grouping stylesof occam 2.The barrier synchronization paradigm can be expressed in two di�erent forms in our language.Both synchronization primitives terminate when all involved processes have reached thatprogram step. The �rst allows synchronization of a known number of processes, the second iscapable to calculate the number of processes itself, but is restricted to processes hierarchicallydecomposed into groups. Their syntax is described in Fig. 3.process ::= synchronization .synchronization ::= 'SYNC' '(' sync.id ',' sync.cnt ')'j 'SYNC' '(' sync.id.old ',' sync.id.new ',' sync.cnt ')' .sync.id ::= actual .sync.id.old ::= actual .sync.id.new ::= actual .sync.cnt ::= actual .Figure 3: Syntax of synchronization statementsIn the more e�cient �rst variant sync.id identi�es the process group meeting at this bar-rier by a positive integer expression, conveniently de�ned by a constant abbreviation, andsync.cnt denotes the number of processes involved by a non-negative integer expression. Thesynchronization statement terminates when sync.cnt processes have arrived.4

- - PRAM algorithm for calculating the minimum by binary tree method- - An illustration of this algorithm can be found in Fig. 5VAL INT N IS 2�P: - - number of inputsPUBLIC [N] INT A: - - array of input numbersPROC Main.User.Proc (VAL INT no.proc, dim, my.nr, [4] CHAN OF INT::[]INT in, out)SEQinit(my.nr) - - initialization of the input arraySEQ j = 0 FOR log(N)SEQSYNC(((my.nr / (1<<j)) + (8<<j)), (1<<j)) - - sync groups of processors- - o�set of 8<<j is necessary to avoid any trouble due to reuse of namesIF(my.nr REM (1<<j)) = 0 - - left processes of pairsINT tmp1, tmp2:SEQ - - compare with right process of pairPARtmp1 := A[2�my.nr]tmp2 := A[(2�my.nr) + (1<<j)]IF tmp1 > tmp2PAR - - swap numbersA[2�my.nr] := tmp2A[(2�my.nr) + (1<<j)] := tmp1TRUESKIPTRUESKIP: - - minimum is in A[0]Figure 4: Example for barrier synchronization with groupingj = 1

j = 0

j = 2

Inputs

Processors

Comparisons

Groups of synchro−nizing processors

A[0] A[7]

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

Figure 5: Illustration of the minimum algorithm5

The second synchronization variant can be used for hierarchically composed process groupsonly. The number of processes is initially known and assigned to an INT variable sync.cnt.The processes subdivide themselves into subgroups by one of the mentioned grouping styles.When one of them arrives at the synchronization barrier it declares its old and new groupmembership in sync.id.old and sync.id.new . The synchronization statement terminates whenall processes of the old group have declared their new membership and returns the new numberof processes in variable sync.cnt which can be used for further synchronizations.Fig. 4 shows the use of synchronization, another example is given in Chapter 5. The synchro-nization statements can be used in any context where processes are allowed in occam 2, butnot in functions. The OCCAM-light notation for synchronization may be improved in futureto ease the application of barrier synchronization in typical programming examples.2.3 Input/Output ExtensionsThe programming language occam 2 lacks a possibility for parallel input from or outputto the host machine from arbitrary network nodes. The user has to transfer the data tothe Transputer connected with the host explicitly. OCCAM-light is provided with a set ofinput/output operations accessible from any process of the user program, which use theglobal communication features of the routing library. Their syntax is similar to procedurecalls of the library routines normally used for input/output of the Transputer connected withthe host and is shown in Fig. 6.process ::= extension .extension ::= extension.name '(' actuals ')' .extension.name ::= 'PRINT.STR' j 'OPEN' j 'CLOSE'j 'FREAD' j 'FWRITE' j 'GETS' j 'PUTS'j 'FLUSH' j 'SEEK' j 'TELL' j 'EOF' j 'FERROR' .Figure 6: Syntax of input/output operationsThe statement PRINT.STR(string) can be used to send output strings to the host machine.For output of other data, conversion functions of a library like convert.lib in the INMOS toolsetcan be used. Other OCCAM-light statements allow the use of a set of library functions formanipulation of �les. Their meaning is documented in Fig. 7. The type and de�nition oftheir parameters is in our case derived from the input/output library hostio.lib and can beseen in Fig. 7.The OCCAM-light extensions for input/output operations can be used in any place wherea process is allowed in occam 2 except in functions. The execution order of input/outputoperations in parallel processes is not de�ned. The input/output operations terminate whenthe operation has been executed and the result has been returned to the calling process, exceptfor PRINT.STR which continues immediately after the data have been sent to the network.2.4 Channel Communication and the Program FrameworkAn OCCAM-light user program may use channel communication between processes on thenetwork nodes without any restriction. The compiler guarantees that channel communication6

PRINT.STR (VAL []BYTE text)Print string text.OPEN (VAL []BYTE name, VAL BYTE type, mode,INT32 streamid, BYTE result)Open binary or text �le name depending on type and mode, yields streamid.CLOSE (VAL INT32 streamid, BYTE result)Close �le denoted by streamid.FREAD (VAL INT32 streamid, INT length, []BYTE data)Read length bytes of data from binary �le streamid.FWRITE (VAL INT32 streamid, VAL []BYTE data, INT length)Write length bytes of data to binary �le streamid.GETS (VAL INT32 streamid, INT length, []BYTE data, BYTE result)Read string data from a text �le streamid.PUTS (VAL INT32 streamid, VAL []BYTE data, BYTE result)Write string data to a text �le streamid.FLUSH (VAL INT32 streamid, BYTE result)Synchronize bu�ers.SEEK (VAL INT32 streamid, o�set, origin, BYTE result)Set position in �le streamid.TELL (VAL INT32 streamid, INT32 position, BYTE result)Get current position of �le streamid.EOF (VAL INT32 streamid, BYTE result)Check for end of �le in �le streamid.FERROR (VAL INT32 streamid, INT32 err.no, INT length, []BYTE msg, BYTE result)Yields error message string msg for �le streamid.Figure 7: Actual parameters for input/output extensionsdoes not interfere with the messages for the library procedures executed in parallel to the userprocess.It is necessary to de�ne how user processes will be distributed on the network nodes. In oc-cam 2, this is declared by placement commands, which di�er in the occam language description[INM88] and occam implementations e.g. [INM91]. In the prototype implementation, we sim-pli�ed the users' task as far as possible and dealt with the con�guration task in prede�ned�les for the \shared memory and synchronization library" only.Thus, in OCCAM-light the user just provides a procedure Main.User.Proc which representsthe code of one processor node in the network as shown in Fig. 8. This is common style inoccam 2 programs, anyway. The user's procedure is executed in parallel to the processes of the\shared memory and synchronization library". For communication with the four neighbournodes the formal channel parameters ofMain.User.Proc can be used with the channel protocolINT::[]INT, where the �rst integer value de�nes the number of integers that are following. Asexplained in Section 4.1 send operations to other Transputer nodes using these channels arenot synchronizing as in occam 2, but will be bu�ered by the library procedures.PROC Main.User.Proc (VAL INT network.size, dimension, processor.number ,[4] CHAN OF INT::[]INT in.chans, out.chans)- - user program: Figure 8: Framework for user programs in OCCAM-light7

The routing library multiplexes user messages with messages of the \shared memory andsynchronization library" if these channels are used.Our prototype of the routing library uses a butter y network for the shared memory simu-lation. Up to now the user programs are restricted to use this topology. A general routinglibrary which can deal with arbitrary network con�gurations of node degree four is beingimplemented.3 The Compiler for OCCAM-lightThis chapter describes the main tasks of the prototype OCCAM-light compiler which has beenconstructed to transform OCCAM-light programs into occam 2 programs with calls to libraryprocedures implementing the language features on a Transputer system. The \synchroniza-tion and shared memory library" itself is presented in Chapter 4. In the �rst section ofthis chapter compiler parts for analysis and output are described that were mostly generatedautomatically by the compiler generation toolset Eli [GHL+92]. The next three sections de-scribe the translation of the program framework, of public variable access, of synchronizationstatements and input/ouput extensions, which are the central tasks of the compiler.3.1 Analysis and OutputAs the OCCAM-light notation is an extension of the language occam 2 the compiler wasdeveloped in two steps: First, an idempotent compiler was constructed, which translatesoccam 2 into identical programs like a source code beauti�er. Then, this compiler was modi�edand extended to deal with OCCAM-light features and their translation. In fact, the Elispeci�cations of the compiler were modi�ed rather than the compiler itself.The only work to be done for the �rst step was to implement an occam 2 scanner module and tospecify an LALR(1) grammar for occam 2, which is suitable for the compiler's transformationtask. Then, an idempotent compiler can be generated automatically by Eli tools.For many common programming languages the scanner speci�cation in Eli is just some lines oftext long, because keywords are extracted from the grammar speci�cation automatically andregular expressions for names and literals are prede�ned. But the unusual occam 2 languagedesign needs more work. The scanner frame program which normally skips white space hadto be rewritten to analyze indentations and deal with continuation lines correctly.For the �rst point changes in the level of indentation are supervised and transformed intoIND, EXD, and CR tokens representing the syntactic information of indentation by two spaces,exdentation by two spaces and end-of-lines in the grammar. For the second point the decisionwhether a line continues a previous one is made in occam 2 basing on the remembered lasttoken being one that can not end a line. Thus, if it was an operator, a comma, a semi-colon,an assignment sign \:=", or one of the keywords IS, FROM, or FOR a continuation is detected,no CR token is generated and changes in indentation are ignored for this line.The transformation of an occam 2 grammar (e.g. from [INM88]) into LALR(1) form was not inall cases easy. For example, the distinction between a variable declaration and an expressionwith a type conversion written \T expr" in an option of CASE depends on the trailing colon.8

To avoid merging of expressions and declarations in the grammar the compiler prototyperestricts occam 2 in this case and demands parentheses around this kind of type conversion.With these tasks done an idempotent compiler could be generated automatically. The resultconsists of an attribute grammar part and a PTG speci�cation. For the use of attributegrammars in Eli for semantic analysis and synthesis see [Kas91]. PTG means \programtext generator", which is another Eli tool providing an easily usable implementation for treestructured textual output also explained in [Kas91]. The attribute grammar contains a callof a PTG-generated function for every grammar rule, thus reproducing the input text. Thisdefault output attribution was changed only in those grammar rules, where the OCCAM-lighttransformations are inserted.To perform the OCCAM-light transformations some semantic analysis work has to be done�rst. This has been implemented by attribute grammar speci�cations using the Eli modulelibrary for standard tasks and some supporting C modules to hold the analysis results.For name analysis there is an Eli module which provides all attributions needed for nestedscopes with di�erent rules for uniqueness and hiding in di�erent languages. The moduleassigns unique keys to identi�ers referring to the same object [KW91]. In occam 2 theremust be not more than one de�nition for an identi�er in a scope. However, a new scopestarts with each de�nition. The kind of identi�ers is distinguished for classes like variables,abbreviations, procedures, or replicator variables by the context of their de�nition and storedin the de�nition table.Further analysis is done for type declarations , because public types must be recognized andthe size of public variables must be calculated. For public arrays the number and range ofdimensions is saved in a type de�nition module, too. In the declaration of array boundariesconstant expressions can be used. Thus, another module contains attributions to calculatevalues of known expressions and abbreviations.The last analysis module collects some additional information for the transformations. Thisincludes tasks like tagging of statements and procedures that contain OCCAM-light languageconstructs and counting of public variable occurrences on both sides of assignment statements.Afterwards, the translation can be started.3.2 Translation of the Program FrameworkThe overall structure of an OCCAM-light program and the occam 2 parts are not changed bythe compiler, but as mentioned in Section 2.4 the user's node procedure is started in parallelto the \synchronization and shared memory library" processes. Thus, the compiler transformsthe OCCAM-light program as shown in Fig. 9.The generated node procedure starts with a de�nition of constants and a public variablemap containing statically computable information for the library procedures, e.g. the sizeof the memory needed for all public variables. The mapping denotes the starting o�setsin this memory for each public variable and is indexed by public variable object numbers.Then, the library itself is included, followed by the transformed user procedures includingMain.User.Proc. The latter is invoked by the body of the node procedure in a parallel state-ment together with the library procedure. In the real compiler output there is an additional9

PROC Node.Proc ( VAL INT X..dim, : : :params: : : , CHAN OF ANY : : : links: : : )VAL INT MAXOBJID IS pubvarcnt: : : : - - and other constants for the libraryVAL [MAXOBJID] INT PUBVARMAP IS [: : : ]: - - o�sets of public variables#INCLUDE "Route.Random.Rank.occ" - - library include: : : (possibly) transformed user procedures : : :PROC Main.User.Proc (VAL INT network.size, dimension, processor.number ,[4] CHAN OF INT::[]INT in.chans, out.chans): : : transformed user program : : ::PARMain.User.Proc (X..N, X..dim, X..my.nr, X..IN, X..OUT)Route.Random.Rank (X..N, X..dim, X..my.nr, X..IN, X..OUT): Figure 9: Generated program framelibrary call for termination. Variables de�ned by the compiler which could be in con ict withuser-de�ned identi�ers are given a pre�x \X..".In the prototype compiler, up to now, there is one library named Route.Random.Rank. How-ever, the compiler is prepared to change this name and its include path by commandlineoptions. Other options are provided to modify the number of processors assumed, the hashfunction used to map public variables to processors and the size of communication bu�ers.Thus, exibility for experiments with other implementations and parameters is gained. Alsothere are commandline switches to control the insertion of informations about the variablemapping in the ouput and the generation of error listings.3.3 Translation of Public Variable AccessThe translation of public variable access is the central part of the compilation. Assignmentswith read or write access to public variables are supplemented by library procedure calls toread or write the data from or to the shared memory based on the collected information ofthe analysis phase. The expressions to be assigned and the variable on the left side of suchan assignment have to be changed as well.[u1][: : : ] PUBLIC [up][: : : ] T x:x[expr1][: : : ][exprp][: : : ] := expr +- - if s mod 4 = 0: - - if s mod 4 6= 0:VAL [s0] INT X..Wn RETYPES expr: VAL [s] BYTE X..En RETYPES expr:[4s0] BYTE X..Bn:SEQ SEQ[X..Bn FROM 0 FOR s] := X..EnVAL [s0] INT X..Wn RETYPES X..BnX..WRITE ( objid + (elem(expr1; : : :)), offs(exprp; : : :), sX..Wn, X..TO.RT[chnum], X..FROM.RT[chnum])Figure 10: Transformation scheme for public writesFig. 10 shows the general transformation scheme for a partial assignment to an element of an10

array of public variables x distributed in units of type [up][: : : ] T . If the public variable x werede�ned inside a procedure, the library function X..WRITE.LOCAL would be used instead.The assignment refers to a part of element [expr1][: : : ] of x denoted by [exprp][: : : ]. Thisidenti�es an object in the shared memory and an o�set to the part to be modi�ed. As bothindex constructions may contain arbitrary expressions, the symbolic source text is used tobuild the usual linear array access functions, which are abbreviated by elem(expr1; : : :) andoffs(exprp; : : :) in the �gure. The �rst is used to calculate the object number of this elementof the public variable, the latter to get the o�set to the part of the variable to be written.The upper bounds of dimensions uk and the size of the base type T are used as factors inthe access functions. Of course, both index constructions may be missing, in which case theaccess functions are left out.Besides identi�cation of the right place in the shared memory the data must be communicated.Integer arrays are used throughout the library for this purpose to simplify the data transfer.In the translation scheme s denotes the size in bytes while s0 = d s4e is the size measured in (32bit) integers. If s is not divisible by 4 the compiler produces the more complicated occam 2code on the right hand side to copy the data in a temporary variable. In both cases the resultis retyped to an array of integers named X..Wn, where n is a generated unique number.The third aspect of transformation deals with internal communication of data to or from therouting process in the library. This is done by an array of channels with protocol INT::[]INT,because this is the only means to transfer data between parallel processes in occam 2. Thecompiler assigns channel numbers of this array for processes needing communication to therouting process.The solution for this problem is found by a bottom up pass in the syntax tree calculatingthe number of channels dP needed by processes and a following top down pass, where theexpressions chnum for each process are constructed. The following enumeration shows howthis amount dP , the current channel denoted by cP and the channel number to be used afterthis process c0P can be calculated.� P is a simple process like SKIP, STOP, or an assignmentdP = 1, if OCCAM-light constructs are used, 0 else; c0P = cP + dP� P is SEQ Q1 : : :Qn, or IF, CASE, WHILE, ALTdP = maxfdQ1; : : : ; dQng; cQi = cP ; c0P = cP + dP� P is SEQ I = a FOR b QdP = dQ; cQ = cP ; c0P = cP + dP� P is PAR Q1 : : :QndP =Pi dQi ; cQ1 = cP , cQi+1 = c0Qi ; c0P = cP + dP� P is PAR I = a FOR b QdP = b � dQ (b is constant); cQ = cP + (I � a) � dQ; c0P = cP + dP� P is a procedure call Q(: : :)dP = dQ (Q has been de�ned and analyzed before);cQ = cP (part of the channel array is additional parameter); c0P = cP + dP11

� P is a procedure de�nition PROC Q(: : :)dP = dQ; cQ = 0 (the channel array parameters are used); c0P = cPWhile the amount of channels needed is constant, the assigned channel numbers may dependon PAR replicator variables and are put together to expressions with the help of PTG. There isone special point for procedures: If a procedure uses OCCAM-light constructs a suitable partof the channel array is introduced as additional parameter and used for channel numbering inthe procedure. This is necessary because the occam 2 de�nition requires the occam 2 compilerto be able to detect whether a channel could be used illegally e.g. by two parallel processesfor input.[u1][: : : ] PUBLIC [upi ][: : : ] Ti xi:: : : := : : : xi[expr1][: : : ][exprpi ][: : : ] : : :+8ik [s0i] INT X..Rni:SEQPAR8ik X..READ (objidi + (elemi(expr1; : : :)), offsi(exprpi; : : :), sik X..Rni, X..TO.RT[chnumi], X..FROM.RT[chnumi])8ik - - if si mod 4 = 0: - - if si mod 4 6= 0:k VAL Ti X..Eni RETYPES X..Rni: VAL [4s0i] BYTE X..Bni RETYPES X..Rni:k VAL Ti X..Eni RETYPES [X..Bni FROM 0 FOR si]:: : : := : : : X..Eni : : :Figure 11: Transformation scheme for public readsThe transformation scheme for public variable reads is shown in Fig. 11 and is mostly sym-metric to writing. It may seem to be more complex, because more than one public variablemay be read in one assignment. The reading library procedures are called in parallel in thiscase.For ease of compilation the prototype restricts usage of public variables in index expressionsof a written public array variable. Also, the compiler does not handle multiple assignmentswith public variables or occurrence of public variables in actual parameter lists, conditionalexpressions or selector expressions, and in channel statements. This could be easily imple-mented by using similar schemes as for assignment statements, but does not present a severerestriction for programmers either. Further improvements could be achieved by optimizationof public variable access similar to the di�erentiation between local and global publics. Thisis discussed further in Chapter 6.3.4 Translation of Synchronization and I/O ExtensionsThe translation of synchronization and input/output extensions does not show any new prob-lems. They are converted to library procedure calls, which get additional channel parametersfor communication with the router just as described for assignments in the previous section.The two types of synchronization are mapped to two di�erent library procedures. Thereforethe number of parameters of SYNC is examined. The input/output extensions are handledsimilarly, the channel parameters being added after the original ones.12

4 The Synchronization and Shared Memory LibraryThe interface of the runtime system of OCCAM-light consists of procedures, collected in the\synchronization and shared memory library". The compiler installs these procedures inparallel to the user process on every processor. The library is implemented in occam 2 of theINMOS Occam Toolset [INM91]. No assembler code is used for optimization.P0

Host Transputer

additional Transputer

P n2

Workstation

Butterfly of dimensionn(wrap-around)

Figure 12: Hardware con�gurationThe structure of the current hardware con�guration is shown in Figure 12.The structure of all processes on one Transputer can be seen in Figure 13. All the edges inthis process graph are bu�ered with queues. The \Routing" process, the heart of the system,contains the routing and the shared memory management parts.4.1 Channel CommunicationBecause OCCAM-light is a multiparadigm language, a programmer can use all network fea-tures of occam 2. This implies that neighbour-to-neighbour link communication is possible.Therefore the compiler installs, for each of the 4 output links, a multiplex process whichmerges the messages from the user and the messages from the shared memory simulationinto one stream. The input links are demultiplexed in the reverse manner (see Fig. 13). Ifthere is no link communication in the user process, these multiplex/demultiplex processesare not installed. In both cases, the link communication is no more synchronizing, as it isin occam 2, because the routing kernel accepts the message from an output statement (chan! x) immediately. If this change e�ects a program seriously, the occam 2 semantics can besimulated by using an output statement followed by an input statement (chan ? x) and inthe receiving process an input statement followed by an output statement.4.2 Public VariablesThe philosophy of occam 2, based on the CSP concept by Hoare [Hoa78], does not allowthat parallel processes share variables. This lack stimulated us to implement a virtual sharedmemory simulation with a front-end consisting of some procedures which can be used by thecompiler. 13

Main.User.Proc

MUX MUX MUX MUX

DEMUX DEMUX DEMUX DEMUX

k

k

MUX

DE/

Routing

Figure 13: Processes on one TransputerTo avoid high congestion in the routing, we use a standard shared memory simulation fornetworks which uses universal hashing. The compiler randomly choses a hash function h froman appropriate universal hash class (compare e.g. [Lei92]). The shared memory simulationdistributes the public variables according to h, i.e. variable x is maintained in processorPh(x). This achieves an almost random distribution of variables over the memory modules.These modules are realized by a large array of integers. Because Transputer networks areasynchronous, concurrent reads or writes are handled in an arbitrary order.The hashing used in OCCAM-light does not distribute registers or every single part of a publicvariable but the units, i.e. the parts of public variables de�ned after the keyword PUBLIC(compare Section 2.1). At most these units may be accessed, access to parts or substructuresis possible.The routing scheme implemented so far, is the \random rank protocol" described in [Lei92].A routing which uses shortest paths is implemented, but there is not yet a convenient wayfor using an arbitrary network. Furthermore, a \restricted shortest path" algorithm, wherethe message ow is uni-directional in the network, is under development.The \random rank protocol" is programmed using a strategy not typical for occam 2. Itdoes not use parallel processes at all, but is realized in an event-driven manner, which ismuch faster in our case. The di�erent steps of message processing are coupled with queues.14

Messages are copied, not sent via channels.ALTelements.in(ROUTING.Queue) > 0... Routingelements.in(SYNC.Queue) > 0... Receive... Send.Acknowledgeelements.in(SHARED.MEMORY.Queue) > 0... Read... Writeelements.in(TERMINATION.Queue) > 0... Send TERM to all processesFigure 14: The \Routing" process of the random rank algorithmWhenever a message is in a queue, the ALT-statement listed in Figure 14 selects one of thealternatives. Our implementation of this routing is restricted to butter y networks.The \shortest path routing" calculates all pairs of shortest paths in a preprocessing phasefrom the adjacency list of the network which is used. The result is stored on every processor ina lookup table, where for each processor in the network, the link is listed to which a messagehas to be forwarded. So far, there is no network description part in OCCAM-light which isnecessary to use arbitrary topologies.The \restricted shortest path" routing uses the wrap-around butter y network in one directiononly and is a simple greedy algorithm. This algorithm is not yet implemented completely,but �rst tests show, that this routing is the fastest of the three.In Figure 15, a comparison of the three routing algorithms on a butter y network of dimension3 is shown for the PRAM version of the pre�x algorithm introduced in Chapter 5 for di�erentdegrees of parallelization. This algorithm is chosen because it contains many accesses topublic variables and a lot of synchronization.In order to make access to public variables as e�cient as possible, the library distinguishesbetween local and global public variables. Local public variables are shared by processes onone Transputer. They are managed on this processor, so no routing is necessary to accessthem. Global public variables are shared by processes on di�erent processors. They aredistributed in the described way.The management of shared memory on each processor is done in one array of integers. Othertypes of variables are converted automatically to this type. The compiler constructs a tableof indices of the public variables in this array, so that an access to them needs this addressand the length of a variable measured in INTs. Furthermore, all bu�ers, organized as FIFOqueues, are implemented as circular array of integers. To guarantee that link communicationdoes not block all other data ow on a processor, all channels are bu�ered (cf. Fig. 13), anda special protocol is used for inter-processor communication. Designated request messagesare sent in order to �nd out if the receiver has enough space for the data. After receiving anacknowledgment from the receiver, the data will be transmitted.15

10 2010

10

20

30

40

50

60

70

75

Number of processes per processor

Run

tim

e [s

ec]

Random Rank

Shortest Path

Restricted Shortest Path

0

10

20

30

40

50

60

70

75

Figure 15: Comparison of the routing algorithms on a butter y network4.3 SynchronizationIn Section 2.2, we described the techniques of barrier synchronization and groupwise comput-ing. OCCAM-light supports these concepts by the means of two synchronization constructs. Asimple algorithm synchronizes a subgroup of processes of known size and a more complicatedone subgroups of an unknown size. The selection of one of these is done by the user.The following describes the algorithm for an unknown number of processes participating inthe synchronization step.Given the number d of processes in the group M 0. Each member of M 0 knows d. M 0 is nowpartitioned into an unknown number of subgroups of unknown sizes. The following algorithmworks, if each process is a member of exactly one new subgroup (see Figure 16).1) For each new subgroup M 0i , there is a synchronization register SRi. If the name of16

M′

current partitioning, group of all processesM

new partitioning ofM′Figure 16: Partitioning of a subgroup"add 1"

"add 1"...

SR 1...

SR 2

.

..

SR k

SR M ′

k : number of new groups inM ′Figure 17: Visualization of the synchronization algorithmthe subgroup is the integer number i, then the address of SRi is the value of the hashfunction h(i), which can be computed by every member of M 0i (each member knows i,because otherwise they couldn't even detect that they belong to M 0i).2) Each process belonging to group M 0i sends an \add 1" message to SRi. The processwhich administrates this register, initializes it with 0 and increments it by 1, wheneverit receives a message \add 1".3) For each received \add 1" message on each SRi, a message \add 1" is sent to a syn-chronization register SR belonging to the group M 0, and SR is incremented by 1. Theinitial value of SR is 0.4) If SR contains the known value d, then it sends an acknowledgment to each SRi.5) If a SRi receives this acknowledgement, then its value is the number of processes in thenew subgroup M 0i . This value is sent to each member of this group as an acknowledge-ment. 17

6) If a process receives a synchronization acknowledgement for its subgroup, it knows thenumber of members in it and that all this members have reached the synchronizationpoint.If the number of members in a synchronization step is known for each new subgroup, let itbe di, then each SRi can count the \add 1" messages until it reaches the value di, then it cansend the acknowledgements.The bookkeeping information for sending the acknowledgments is organized as an array ofintegers, where there is one bit for each processor and each synchronization register. This bit isset, if at least one process on a processor is involved in the synchronization with that register.If a certain ratio (e.g. 70%) of processors have to get an acknowledgment, the messages aresent to all processors using a minimum spanning tree of the network, which is more e�cientthan sending an acknowledgment message to the involved processors only. In each node ofthe tree, the message is doubled and sent to the successing nodes.4.4 Input/OutputThe input/output facilities described in Section 2.3 are implemented via routing messages tothe host processor, which forwards them to the UNIX system. The routing is the same asfor accessing public variables, so no extra e�ort is necessary. Again, no deterministic order ofthe input/output messages can be guaranteed.5 Examples in OCCAM-lightAfter the description of the technical details of OCCAM-light, we now want to give someexamples, how to use our new language in very di�erent manners. These examples show thatOCCAM-light can be used for a pure PRAM programming style with a certain loss of e�ciency,and how to write in a comfortable way very e�cient programs.If there is no use of channel communication in an OCCAM-light program, then the multiplexprocesses will not be installed, so that the routing is much faster. This is typically the case ifone writes PRAM algorithms in OCCAM-light. In spite of this optimization, the performanceis not very good, because in such programs, there is access to public variables in almostevery step and between them there have to be synchronizations. At least, it is very easy totranslate PRAM algorithms to OCCAM-light. An example is shown in Figure 18, which is asimple PRAM algorithm for calculating the pre�x sum of kN numbers on kN logical and Nphysical processors.To achieve a better performance, the number of accesses to public variables and the number ofsynchronization steps has to be reduced. For many problems, this can be done very easily. Forthe pre�x sum computation, we have to increase the amount of input numbers. Then everyprocessor can read a block of them from a public variable and calculate the pre�x withoutany further access to public variables or synchronization. After this internal calculation, theresult of the rightmost component of the block is written to an auxiliary public variable, whichcan be read by one distinguished processor after a synchronization. Within a local copy ofthis auxiliary variable, whose length is equal to the number of processors, another internal18

pre�x computation takes place. After a further synchronization step, the other processorscan correct their elements and the computation is done. The corresponding OCCAM-lightprogram can be found in Appendix A.VAL INT virt.proc IS 32: -- number of virtual processors (a power of 2)VAL INT phys.proc IS 8: -- number of physical processorsVAL INT par.proc IS virt.proc/phys.proc:[virt.proc]PUBLIC INT A: -- for each virtual processor one cellPROC Main.User.Proc(VAL INT N, dim, my.nr, [4]CHAN OF INT::[]INT In, Out)PROC Parallel.Prefix(VAL INT pid) -- the procedure running on every-- virtual processorVAL INT sync.Reg IS 1: -- name of the synchronization registerINT r:SEQr := 1WHILE r <= virt.procSEQSYNC(sync.Reg, virt.proc) -- wait until one round is completeIF -- select upper halves of growing groups(pid \ (2*r)) >= rA[pid] := A[pid] + A[(pid - (pid \ r)) - 1]TRUESKIPr := 2 * r:VAL INT init.barrier IS 0: -- name of the synchronization registerSEQPAR i = 0 FOR par.proc -- start of local processes in parallelVAL INT pid IS ((my.nr*par.proc)+i):SEQInit.Array() -- initializes the input arraySYNC(init.barrier, virt.proc) -- wait until Init is doneParallel.Prefix(pid) -- do the computation-- A now contains the result: Figure 18: OCCAM-light program for the PRAM version of the Pre�x Sum AlgorithmIn this optimized parallel pre�x program, the number of accesses to public variables dependson the number of processors only, but no longer on the number of inputs. The number ofsynchronization steps is a small constant (in fact 4). The running time, compared with thePRAM program, decreases by a factor of N and increases very slowly with respect to thenumber of inputs because the internal computations are still rather small compared with thetime used for communication and synchronization.An implementation of the pre�x algorithm without use of public variables or in occam 2 ismuch faster. This is no surprise, because the scheme of communication used in this algorithm19

shows that nearly no information has to be routed a long way through the network, but isjust sent to the neighbours of a processor, which can be done much faster without use ofpublic variables. Thus, such simple algorithms should be written in occam 2 if one wants toproduce e�cient code.The next example demonstrates how to construct e�cient programs in OCCAM-light for morecomplex algorithms. We want to calculate the connected components of an undirected graph.The algorithm uses the adjacency matrix as input and calculates for each vertex the numberof the connected component it belongs to, i.e. the vertex with the numerical smallest label inthe connected component.The algorithm is a slight modi�cation of that by Hirschberg et al. [HCS79] and works asfollows.Let a club be a tree graph in which all edges are incident with the root which is the vertexwith the numerical smallest label in the club (cf. [GR88]). Given now an undirected graphG = (V;E), jV j = n, and p processors. Each processor manages n=p vertices of the graphand obtains a local copy of that part of the adjacency matrix which contains its vertices.The label in each item points to the pseudo code formulation of the algorithm which can befound in Fig. 20.1. At the beginning, each vertex is set to be its own club (i1).2. Wait until all processes are ready (s1).3. Find a neighbour which belongs to a di�erent club up to now (s2).4. Write the local results into a global array (s3).5. Wait until the update is done (s4).6. Read all local results which were written in step 5 (a1).7. Unify the calculated clubs, which belong to the same connected component into a newone. This is done with a sequential depth �rst search algorithm on each processorindependently, so no communication is necessary in or after this step (a2).8. Repeat steps 3 { 7 until no neighbours can be found any more.Instead of unifying the clubs in a distributed computation with a following communicationphase, it is faster to compute the entire results on each processor (cf. step 7 of the algorithm).Tests with several inputs show, that the algorithm is stable with respect to the number ofconnected components in the graph. For di�erent sizes of input, Fig. 19 shows the behaviorof the algorithm on butter y networks of dimensions 2 to 4. There can also be found acomparison to a simple sequential breadth �rst search algorithm written in occam 2 on oneTransputer. 20

0 1000 2000 3000 4000 5000 6000 7000 8000 9000100000

100

200

300

400

500

600

700

800

900

1000

Number of nodes in the graph

Run

ning

tim

e [s

ec]

Seqential Algo.

on Bf(2)

on Bf(3)

on Bf(4)

0

100

200

300

400

500

600

700

800

900

1e+03

Figure 19: Results of the Connected Component Algorithm21

PUBLIC [n] INT global.New:PROC Main.User.Proc (INT my.nr ....)[n] PUBLIC INT New:[n] PUBLIC INT cc:SEQ--------------- Initialization ------------i1: SEQ i = 0 FOR ncc[i] := iWHILE NOT readySEQ------------- Searching -----------------s1: SYNC (1, p)s2: SEQ i = my.nr*(n/p) FOR n/pNew[i] := "next vertex j with (i,j)is an edge and cc[i] <> cc[j]"s3: [global.New FROM my.nr*(n/p) FOR n/p]:= [New FROM my.nr*(n/p) FOR n/p]s4: SYNC (1, p)------------- Compression ---------------a1: New := global.Newa2: "calculate all connected components in G' = (V,E'),where cc[i] := numerical smallest label of a vertexin the connected component of iE' consists of the following edges: (i,cc[i]),(cc[i],i), (i,New[i]), (New[i],i)": Figure 20: Pseudo code for the Connected Component Algorithms22

6 ConclusionsWe present a new language for parallel computers called OCCAM-light which combines mes-sage passing with the shared memory paradigm. It uses occam 2 as basis to express concur-rency and communication by message passing and introduces public variable types for thesimulated shared memory. Further additions allow barrier synchronization and input/outputoperations that can be used from every processor node in the network. A compiler prototypehas been implemented to translate OCCAM-light programs to occam 2 code with calls to a\synchronization and shared memory library". The compiler's transformation scheme and achannel numbering technique has been explained as well as the implementation of the twoused synchronization algorithms.Our experience is that OCCAM-light allows a convenient way of programming parallel al-gorithms on Transputer networks. The e�ciency of the produced code depends on the useof synchronizations and communication with help of the shared memory library. Of course,algorithms making use of neighbour communication in an appropriately chosen topology canbe programmed in occam 2 much more e�ciently. But the loss of e�ciency is tolerable foralgorithms exchanging blocks of data in a non-regular way or for algorithms which are verydi�cult to program with the use of message passing only.The prototype implementation is to be improved in several ways. Some of the technicalrestrictions mentioned above will be lifted. The speed of the routing will be increased sig-ni�cantly by changing the routing algorithm. The procedural notation for synchronizationsshould be improved by supporting typical usage of barrier synchronization and grouping withnew language features. Also, optimization for data access can be added for public variablesused by subgroups of processors only. Our idea is to �nd more usable and convincing solutionsfor parallel programming than pure PRAM or message passing models.AcknowledgementsWe are grateful to Uwe Kastens and Friedhelm Meyer auf der Heide for many helpful discus-sions and remarks. Furthermore, we owe special thanks to Frank Nennecker, who has donemost of the implementation work.This work was supported by the Deutsche Forschungsgemeinschaft under the title \Forscher-gruppe E�ziente Nutzung massiv paralleler Systeme", by Esprit Basic Research Action No.7141 (ALCOM II), and by the Volkswagenstiftung.23

References[All91] James Allwright. The WP6 BSP-Occam library. PUMA Working Paper 20, Uni-versity of Southampton, March 1991.[And91] Gregory R. Andrews. Concurrent Programming: Principles and Practice. TheBenjamin Cummings Publishing Company, 1991.[GHL+92] R.W. Gray, V.P. Heuring, S.P. Levi, A.M. Sloane, and W.M. Waite. Eli: Acomplete, exible compiler construction system. Comm. of the ACM, 35(2):121{131, 1992.[GR88] A. Gibbons and W. Rytter. E�cient Parallel Algorithms. Cambridge UniversityPress, 1988.[HCS79] D.S. Hirschberg, A.K. Chandra, and D.V. Sarwate. Computing connected compo-nents on parallel computers. Comm. of the ACM, pages 461 { 464, 1979.[Hoa78] C.A.R. Hoare. Communication sequential processes. Comm. of the ACM, 21(8):666{677, 1978.[INM88] INMOS Limited. occam 2 Reference Manual. Prentice Hall, 1988.[INM91] INMOS Limited. occam 2 toolset Manual, March 1991.[Juv92] Simo Juvaste. The programming language pm2 for pram. Report Series B B-1992-1,University of Joensuu, Dep. of C.S., 1992.[Kas91] Uwe Kastens. Attribute grammars in a compiler construction environment. InH. Alblas and B. Melichar, editors, Attribute Grammars, Applications and Systems,volume 545 of LNCS, pages 380{400. Springer Verlag, 1991.[KW91] U. Kastens and W.M. Waite. An abstract data type for name analysis. Acta Infor-matica, 28:539{558, 1991.[Lei92] F.T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays - Trees- Hypercubes. Morgan Kaufmann, 1992.[PT91] M. Philippsen and W.F. Tichy. Compiling for massively parallel machines. In Proc.Workshop on Code Generation, Dagstuhl Castle, 20-24 May 1991, WICS. SpringerVerlag, 1991.[Sei90] Thomas Seifert. Spezi�kation einer Sprache zur Simulation von PRAM-Modellenund ihre �Ubersetzung nach OCCAM. Diploma thesis, Universit�at Dortmund, 1990.[Val89] L.G. Valiant. Bulk-synchronous parallel computers. In M. Reeve and S.E. Zenith,editors, Parallel Processing and Arti�cial Intelligence, pages 15{22. Wiley, 1989.24

A AppendixE�cient OCCAM-light program for the Pre�x Sum Algorithm.VAL INT proc IS 8: -- number of processorsVAL INT l IS 1000: -- block lengthVAL INT n IS l*proc: -- length of the input[proc]PUBLIC [l]INT A: -- contains the input data and also returns the resultPUBLIC [proc]INT top.one: -- auxiliary public variablePROC Main.User.Proc(VAL INT N, dim, id,[4]CHAN OF INT::[]INT EIN, AUS)#USE "string.lib" -- INMOS Occam Toolset LibraryVAL INT barrier IS 0 : -- name of the synchronization register[l]INT a.local : -- contains a local copy of a block of APROC pref.internal([]INT b) -- procedure for the internal prefix calculationINT len, last.one:SEQlen := SIZE blast.one := b[0]SEQ k = 1 FOR len-1SEQlast.one := b[k] + last.oneb[k] := last.one:SEQ-- initializationinit(a.local, id) -- initializationA[id] := a.localSYNC(barrier, N) -- wait until all processors are ready-- solve within each processora.local := A[id] -- make a local copy of a block of Apref.internal(a.local) -- internal calculation of the prefix sumtop.one[id] := a.local[l-1] -- write into the public variable top.oneSYNC(barrier, N)25

-- solve using the local algorithm for top elements in proc. 0IFid = 0[proc]INT top.one.local:SEQtop.one.local := top.onepref.internal(top.one.local)top.one := top.one.localTRUESKIP-- copy top.one back to ASYNC(barrier, N)a.local[l-1] := top.one[id]A[id][l-1] := a.local[l-1]SYNC(barrier, N)-- correct other elementsIFid = 0SKIPTRUE-- correct elementsINT last.sum :SEQlast.sum := A[id-1][l-1]SEQ j = 0 FOR l-1a.local[j] := a.local[j] + last.sumA[id] := a.local -- write the result back to A-- A now contains the result: -- Main.User.ProcFigure 21: E�cient OCCAM-light program for the Pre�x Sum Algorithm26

occam-light { a multiparadigm programming language for

Documents