the connection machine opportunity for the implementation of a concurrent functional language

Future Generation Computer Systems 7 (1991/92) 231-245 231 North-Holland

The connection machine opportunity for the implementation of a concurrent functional language

Claudio Bettini a and Luca Spampinato b

a Computer Science Department, University of Milano, via Comelico, 41, 20135 Milano, Italy h QUINARY S.p.A, via Crivelli, 15/1, 20122 Milano, Italy

Abstract

Bettini, C. and L. Spampinato, The connection machine opportunity for the implementation of a concurrent functional language, Future Generation Computer Systems 7 (1991/92) 231-245.

A concurrent functional language is presented, Parlam, which includes primitives for explicit control of parallel reductions. An abstract machine for Parlam (ParSEC) is derived from Landin's SECD machine by extending it for dealing with parallel computation. Several issues and problems about its implementation on a massively parallel computer are discussed and a specific implementative model for ParSEC under the Connection Machine is described.

Keywords. Concurrent languages; functional languages implementation; lazy evaluation; massively parallel architectures.

I. lntroduclion

Functional languages present a set of interesting features for the analysis of concurrent system programming. We believe that they offer, at the same time, a number of important opportunities for exploiting the capabilities of modern massively parallel hardware.

Parallel computations can arise in the context of functional languages in at least two ways; first, there is a kind of implicit parallelism: distinct parts of the same program can always be evaluated in parallel [1]. Second, there is the possibility of making these languages to support explicit parallelism by means of suitable functions for creation, execution and synchronization of parallel computations.

Among several approaches to the definition of such functions and languages (in particular refs. [4,6]), we have focused on a particular language, Parlam, defined first in refs. [11,12] and then revised in ref. [3]. This language has a few power- ful primitives for explicit control in the parallel reduction of expressions.

In this paper the matching between the Parlam approach to parallelism and the Connection Ma- chine (CM) features is analyzed.

Section 2 and 3 describe the language, its parallel primitives and a computational model for its interpretation. Section 4 touches on a previous implementation on a serial computer, whereas section 5 is dedicated to the implementation under the CM. Complete syntax and semantics are reported in Appendix A.

2. Parlam

Parlam (PARallel LAMbda) is a functional programming language with explicit concurrency control. In functional languages, the basic idea is the application of functions (expressions) to arguments (other expressions). They are usually derived from the lambda-calculus tradition and are completely side effect free. Parlam was introduced by Prini [11,12], then revisited and implemented at the University of Milano in the framework of the research on functional language.

0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

232 C. BettinL L. Spampinato

Parlam syntax (and fundamentally its "spirit") is similar to Landin's A.E. (Applicative Expres- sions) [9,10]. Programming is based on higher order functions both for data definitions and for control structure abstraction. The set of kernel primitives is intentionally kept small and simple to allow a clearer identification of the most fundamental computational issues. Parlam's formal interpreter is constituted by a term rewriting system derived from the SECD machine. The evaluation of a Parlam expression can be viewed as the reduction of the expression by means of the application of rewrite rules.

The most interesting Parlam feature is the ability to explicitly synchronize reductions of the argument expressions and the function's body expression through the use of the language primitives Init and Wait.

Basically, the reduction of an argument expression is started as soon as the corresponding parameter is involved in the reduction of the function's body expression (parallel call-by-need). The use of synchronization primitives can influence and control this reduction activity. In Fig. 1 an example of a typical function application is shown.

This program gives the definition of a simple version of the cons primitive (dotted pair con- structor) and performs its application to the arguments "a" and 1 + 3. A suspended reduction activity (A1) is assigned to the cons definition (lambda expression) and it is bound to the reference "cons" in the application environment (activity A0). The evaluation of the cons application activates A1 and causes the construction of a suspended reduction activity for each argument (A2 and A3). The application of the result of A1 (a closure) to its arguments binds head to A2 and tail to A3 in the environment and evaluates the cons body. This causes the activation of A2 and, after its completion, the creation of the closure in Fig. 2.

This closure is a pair structure, able to return proper values if applied to "CAR", " C D R " or " N U L L " arguments. The Wait primitive requires the A2 activity to complete before going on with the expression evaluation. The A3 activity, on the other hand, is kept suspended until the resulting closure is applied to a " C D R " argument and an access to the value of the parameter tail is per- formed causing its activation.

Let C O g S : Lamlxla head tail.

Wait head In A1 Lambda msg.

If Same msg As "CAR" Then head Else If Same msg As "CDR"

Then tail Else If Same msg As "NULL"

Then False Else Error

o cons, l , A= I_J I__IA3

AO

Fig. 1. The application of the cons function and the reduction activities for its evaluation.

The primitive init, not used in this example, explicitly activates a suspended reduction activity associated with a specific reference given as argument. In Parlam, activated reduction activities are first class objects and can be also returned as values of applications.

There are two other primitives that can be used to synchronize different reduction activities: • Over A n - It is false if its argument is a

reduction activity (it tests for termination).

• E i the rA n A m- It starts two reduction activities and returns the value of the one that terminates first.

An operational semantics is defined for Par- lam. It is specified by the definition of an inter-

[ Lambda rnsg. If Same msg As "CAR" Than head Else If Same msg As "CDR"

Then tail Else If Same rnsg As "NULL"

Then False Else Error ]

CODE ENV

[ head --> "a" tail --> A31

Fig. 2. Final value from the evaluation of the program in Fig. 1.

A concurrent functional language 233

preter in terms of transitions for an abstract machine (ParSEC), directly derived from Landin's SECD [9,5]. The state of each reduction activity is represented by (symbolic) values in a set of registers. These registers contain all the information the activity needs to perform the evaluation of the assigned expression. In a pure functional programming language evaluations of sub-expressions are independent and cannot influence each other. Even if Parlam cannot be considered fully pure t, reduction activities can proceed indepen- dently, apart from explicit synchronization. The state of the whole computat ion is represented by the list of the states of all the activated activities. The computat ion is per formed applying, in parallel, state transition rules. The transitions are reported in appendix A.

Parlam synchronization capabilities can be used to create a large set of interesting parallel structures (see the example in Fig. 3). As a fur- ther example 2 of using parallel evaluation and functional data structures (e.g. functional inte- gers) consider the following functions:

Zero=Lambda Message.

If Same Message As ZERO Then True

and = Lambda expl oxp2. Latrec

poll = Lambda. If Over expl Then

If expl Then axp2 Else False

Else If Over exp2 Then

If exp2 Then expl Else False

Else poll ( ) In

Init expl In tnit exp2 In

poll ( )

Fig. 3. The And function evaluates in parallel its arguments and returns False as soon as one of them returns False if the first one terminating returns Tr~Je the Al~d final values is

the value of the second evaluation.

Add=Lambda Addendum Augendum.

Init Augendum In

If Addendum(ZERO)Then

Wait Augendum In Augendum

Else

Add(Addendum(PRED),

Pseudo-Succ(Augendum))

PseudoSucc=Lambda Pred.

Lambda Message.

If Same Message As PRED

Then Pred

Else

If Same Message as

ZERO Then False

P s e u d o S u c c returns a closure whose Env contains a binding between the name P r e d and the argument fed to P s e u d o S u c c. When a successor is sent the message PRED, it returns the value bound to P r e d, which had been frozen in Env at the time the closure was created. The prefix P s e u d o is because the function does not wait until P r e d is a final value before returning it. It can return an activity that will eventually produce a value.

i This is essentially due to the O v e r primitive that requires accessing to a global state.

2 This example is taken from ref. [11].

Add allows the execution of the various calls to P s e u d o S u c c to overlap. This may imply that some predecessor of the final v a L may not be final (it can still be an activity) itself when v a l is returned, but will eventually become final.

Fibonacci=Lambda Index.

If Index(ZERO) Then Zero

Else If (Index(PRED))(ZERO)

Then PseudoSucc(Zero)

Else Add(Fibonacci

(Index(PRED)),

Fibonacci

((Index(PRED))(PRED)))

The time needed to compute a Fibonacci number is a linear function of I n d e x (in fact Add computes its two arguments in parallel) while it is an exponential function of I n d e x if a sequential Add is used.

More Parlam programming examples are available in: refs. [11] (flights reservation, parallel data structures constructors, etc.), [3] (a parallel data base model, etc.) and [2] (tautology evaluation, etc.).


3. A computational model for Parlam

A Parlam interpreter can be based directly on its operational model. The fundamental idea is to organize the representation of the on-going computation as a graph in which the nodes are the current state representation of the reduction activities. The edges of the graph are references among activities (e.g. there will be an edge between an activity waiting for a value and the one that is supposed to produce the value). Conceptu- ally, possible reductions are applied in parallel to the graph's nodes; each transition can create new activities a n d / o r transform their state. This, in turn, can add or delete references to other nodes. When a node has no references from other nodes in the graph, it is irrelevant for the current computation and can be physically removed [6]. In this way, during computation, the graph continuously grows and shrinks.

Each node contains the values of several registers of a simple virtual stack machine, the Par- SEC machine (Parallel SEC). The idea is that a computation is spread through a set of SECD machines. The ability to start independent computations mapped on new instances of the register tuple (a different SECD machine), removes the need of the Dump register. The relevant registers are shown in Fig. 4.

State can take the values: TRAV (activated reduction activity), SUSP (suspended activity), COMP (completed activity), WAIT (waiting for the termination of another activity), E ITH (waiting for two competing activities). On Stack are stored the intermediate results of the computation as well as the final one when the activity completes. Env contains an association list representing the local environment of the reduction activity, namely the associations between the vari- ables in the expression under reduction and the related activities or values. Control contains a

compiled form of the Parlam sub-expression under reduction. References is used to store the synchronization references among the activities.

Each activity works on local data by applying a set of transformation rules which implement the rewrite rules defining the operational semantics of Parlam. It is important to note that each process is given all the information relevant to its assigned computation.

In this model, a Parlam program is evaluated starting from a single activity with state = TRAV, Control = the compiled version of the Parlam program, Env = Stack = References = empty. This activity creates a graph during the evaluation (for instance by creating a new activity for each argument in a function application). The graph grows and shrinks due to transitions enabled by TRA V activities. The end of computation is determined by the collapsing of the graph into a single node with State = COMP and Stack = the value of the whole computation. In Fig. 5 is sketched an example of a transition during graph transformation.

A crucial choice for the implementation of this model concerns the synchronization of the completed activities and the activities which need the produced value, i.e. how the graph will shrink. This problem can be faced following two basically different approaches:

(1) The "consumer" activities have a reference to the "producer" and access it when they actually need the value. When a "consumer" is simply waiting for a value, it continuously tests the completion of the "producer" (polling).

(2) The "producer" activity has a reference to the related "consumers" and accesses them, com- municating the final value, as soon as it is available. It also has to wake up all the waiting activities (value propagation).

Another fundamental choice concerns the behavior of the resource al locator/collector, which must be able to deal with data structures as well as with the activities following [6] 3

JStatelStack I Envl Control I References I

Fig. 4. The five registers of the ParSEC abstract machine.

3 Activities no more involved in the computation (not reach- able from the graph's top node) have to be collected as "garbage" data structures.


Q • o~ ~ COMP I / 1 . , / I . / 1 /

/ @ I TRAV ~ ~ , I ~ AP a COMP L , ~ l .--

I susP I.,-'11 I ,oc","eo P

, susP I /11 I LDC'b'COMP I.,~1 ®

TOP

I w A m l I ,I , I I " - J AP2COMP I I

I susP L/ ' I I I LDC "a" COMP I . . '11 ( E ) /i:.;ii:,i:.i~i i ilil i;ii~ ~ ! ! ~ i ~ "

" ~ L ~ ii ii:~::i:.iil:.U i i ~i:. :.iii:,i i:.ii!ii:.i:i~

I susP I . / 1 , I LDC"b" COMP I . ' 1 1 ®

I TRAV I,JI LTI COMP I =

® S E Q ~ N , L L D 1 . . . . . .

Fig. 5. Graph transformation during execution of the application cons("a", "b"), where cons is defined as in Fig. 1. Synchronization is done through polling. We can note these transitions: P2 - instruction AP--, it doesn't find a closure in

Stack[1 ]; change of state in Wait1. P3 - instruction LDF --, it creates a closure and puts it in Stack[1 ].

4. Implementat ion on a serial computer

Since 1984, we have an implementat ion of a polling based model realized on a serial computer 4 at the University of Milano [3]. Parallel application of the transition rules is simulated with a scheduling mechanism.

The implementat ion is based on a copying garbage collection which manages data structure as well as activities. As any other copying GC, it traverses the graph structure representing the

4 I t is i m p l e m e n t e d u s i n g t h e C l a n g u a g e u n d e r a m i n i c o m -

p u t e r r u n n i n g U n i x .

state of the computation but it also is in charge of: (i) calling the evaluator for T R A V activities; (ii) polling the completion of activities waited for and waking up the W A I T and E I T H activities; (iii) collapsing chains of COMP activities and moving values. The activities no more relevant for the computat ion (even if TRAV) are simply collected away as garbage.

This implementat ion validates the proposed computational model, moreover it is a good tool for studying the graph's evolution during computation.

Several statistical evaluations have been done and we have obtained a reasonable estimate of

236 c. Bettini, L. Spampinato

the kind and degree of parallelism that is possible to achieve in this framework. The considered model is concurrency control distributed and asyn- chronous [8]. The low number of instructions (sometimes 3 -4 only) executed by an active process with no need for communication allows to classify this model as one with fine granularity, suggesting to give higher priority to communication channels in an hardware implementation. Communications geometry is strictly dependent on the program being executed. This makes the edges of the graph change dynamically during the computation.

5. Studying an implementation under the connection machine

The Connection Machine 5 (which we design- ate by CM) has a lot of features surprisingly suitable for a parallel implementation of the described model. The CM is a very fine granularity parallel computer (up to 64K processors) with a hypercube connection topology among pools of processors. It is a logic-in-memory architecture which does not physically separate the processors from the main memory. Each elementary processor is a variable length operand ALU and it has a 64 K bit memory. Each group of 16 processors is connected to the same router which links them to the rest of the system via a 12-cube network. This network topology and on-chip routers allow effi- cient communication between any pair of processors. This makes it easy to define and use what- ever virtual connection scheme among processors. The machine is SIMD (single instruction multiple data): each processor executes the current instruction broadcasted to all processors. Individual processors can be instructed to mask out certain instructions so that it is possible to assign certain computations to certain processors.

The fine granularity matches the model's char- acteristics; in particular it matches the low computational activity demanded for each process created during the computation. On the other hand, the dynamic structure of our activities' graph finds a direct architectural correspondence

s The term Connection Machine is a registered trademark of Thinking Machines Corporation. We refer to the CM-2 model. Hardware specs can be changed in newer models.

in the CM connection net, which is general, easy to modify and relatively fast.

The computational model presented in section 3, can be directly implemented on this architecture. The idea is to assign a CM-cell (processor and memory) to each activity. The actual topoi- ogy of CM-cell connections will reflect the edges of the computation graph. Each processor is dedicated to the evaluation of an expression that is part of the whole program. Dependences among expressions and synchronization points are represented by the graph. The computation consists in the evolution of this graph: nodes (CM-cells) can concurrently generate new nodes (allocating new CM-cells) for the evaluation of sub-expressions, expanding in this way the graph or they can complete their activity causing a subgraph to col- lapse (freeing the corresponding CM-cells). In a serial implementation the computation proceeds scanning the graph node by node and performing sequentially single transitions. With the CM implementation we have the opportunity to visit at the same time every node of the graph. Actually, we could apply concurrently all the applicable transitions. This means that each processor based on its memory state, if it is subject to a transition, can execute it. In practice, due to the CM SIMD nature, only the set of processors (activities) subject to the same transition (reduction) will really work in parallel. What we do is actually a MIMD (multiple instruction multiple data) simulation under a SIMD machine consisting in a selection loop for sets of processors subject to the same transition. Considering the very fine granularity and the large average number of elements in each set, this simulation does not involve relevant loss of efficiency.

5.1. The data structures

* Lisp is a parallel version of Common Lisp designed to exploit data level parallelism on the CM [13]. In the following we refer to its basic data and control structures, since it is a common reference for CM programming. However, an ef- ficient Parlam implementation should be based on a lower level language. The basic data structure is the pvar (parallel variable). A pvar is similar to a vector with each component of the vector stored in a different processor. Each component of a pvar is of identical size and is stored

A concurrentfunc~onal~nguage 237

at the same relative address within each processor's memory. We consider a pvar abstraction assuming its components can have complex data types (for example string and list).

We choose to assign special CM-cells for the representation of Environment and Control since these are shared structures. A globally defined Cell-type pvar identifies processors as being of type activity, environment or control. Let us see how each processor's memory is structured, de- pending on its Cell-type: • Activity-cell.

] Cell-type l State l Stack I Pe l Pc [ Refs I Coll[ Pvar State stores the activity's current state (TRAV, SUSP, WAIT, E ITH or COMP); Stack pvar is a fixed length list of locations in each processor's memory for temporary results; these can be Activity-cell numbers (processor's cube-addresses) or Parlam basic values 6. The next two pvars store pointers respectively to the activity's Env- and Control-cell. Refs is a field whose structure is dependent on the synchronization policy and will be discussed in the following section. Some bits are reserved in the Coil pvar for garbage collection.

• EnL,-cell. Activities often inherit their environment from their creator and even closures can share the same environment. Having a separate cell for the environment is memory saving. This cell has a Cell-type pvar and the remaining memory is a Values pvar whose elements can be activity-cell addresses or Parlam basic values. It is not necessary to store in Env an association list. The compiler takes care of translating references to indexes so that values in Env are retrieved by index.

• Control-cell. When the computation starts, the whole compiled Parlam program has to be accessible from the starting activity. It is stored in several Control-cells according to the structure of the compiled code. As soon as it is created an activity can access its code in one step. Also in this case we have two pvars: Cell-type and List. List components, in this case, are instruction codes, numbers of references to o ther Control-cells where sub-expression code is

6 Parlam basic values are numbers, strings and closures.

stored. For example when A1 creates A2 for an argument evaluation, it does not have in its Control-cell the code for the argument, but a pointer to another Control-cell that has it, so that A1 can give to A2 the exact Control-cell's cube-address with the argument code.

5.2. Scheduling active sets

In the MIMD simulation, control is given in turn to these sets of cells, one at a time. More- over the activity-cells will be scheduled according to their State and those Activity-cells in state TRAV according to the transition enabled to fire.

The firing of a transition modifies the pvars' content in the Activity-cells in the currently active set, their Env-cells and possibly it creates new Activity-cells or it wakes-up suspended ones. A brief sketch of how this simulation could be coded is given in Fig. 6.

An important issue is that of idle resources recovery. In the serial implementation we use a copying algorithm [6]. Due to synchronization problems implementing this algorithm under the CM, a reasonable solution is using a parallel version of traditional reference counting or mark & sweep. For new cells allocation it is desir- able, instead of using the traditional free list, using algorithms that optimize connection local- ity, such as the Wave of requests algorithm [7].

5.3. The synchronization policy

During the computation we need that waiting activities get the values they need as soon as they are produced, but we also need that values in Stacks or in Env-cells that were initially references to SUSP or TRAV activities are updated as soon as these activities complete their evaluation. As mentioned in section 3, there are two basically different approaches for synchronization: polling and value propagation.

5.3.1. The polling approach Here the basic operation is that of tracing; it

consists of polling on activities whose value is being waited for and jumping chained activities with COMP state, until a value or the address of an activity not yet COMP is found (see Fig. 7).


(LOOP WHILE (NOT (value- p top)) ;;;repeat the MIMD simulation cycle ;;;until the final value is produced

. (*WHEN activity-cell-p ;;;only activity-cells are selected

(*WHEN (=!! State (!l "tray")) ;;;only activity-cells in state tray are selected

(LOOP FOR x IN Stack DO (*IF (pointer-p!! x)

(*SET x (trace!! x)))) ;;;Stack has been traced (*LET (current-instr (pref!! program-counter Pc)) ;;;based on the next instruction pools of ;;;processors execute the correspondent code

(*WHEN (=l! current-instr (I! "LDC")) (*InsertStack (pref!! (1+ program-counter) Pc)) (*SET program-counter (2+ program-counter)))

(*WHEN (=!! current-instr (!! "LD"))

(*WHEN (=it current-instr (!! "AP")) (*IF (closure-pll (Getfromstackl! Stack 1))

(*LET ((n-pars (pref!! (1+ program-counter) Pc))

(envl (CreateEnv]! n-pars Stack Pc)) ;;;a new Env is created whose first n locations ;;;are the arguments and the rest is inherited. (newact (CreateAct!! (!! "tray") nil!! envl

(Closure-code!! (Getfromstackll Stack 1)))))

;;;a new TRAV activity is created for the actual ;;;function application while the parent activity ;;;goes on with the computation.

(*DelStack (1+ n-pars)) (*InsStack newact) (*SET program-counter (+2 program-counter)))

;;;ELSE, the first Stack location is not a closure. fPROG

(*SET state (!l "waitl")) (*SET sonl (Getfromstackl! Stack 1)) (*DelStack 1))))

(*WHEN (=ll State (l! "waitl")) ;;;activity-cells waiting for a value ;;;are selected

(*SET sonl (trace!! sonl)) (*IF (value-p!! sonl)

(PROG (*SET state (!! "tray")) (*InsStack sonl) (*SET sonl nil!!))))

(*WHEN env-cell-p ;;;Env-cells trace address values

(LOOP FOR val IN Values DO (*IF (pointer-p!! val)

(*SET val (trace!! val)))))

(trace top)) ;;;at the end of each cycle, top is traced.

Fig. 6. * Lisp pseudo-code for the MIMD simulation cycle for parallel reductions, based on the polling approach.


I WA,T I --I ICOMP I --I-.-

i IcoMPI

Fig. 7. Tracing of COMP chains.

Following this approach, Refs in the data structure of Activity-cells consists of two pvars: Sonl and Son2. Each one can store a reference to an activity 7 waited for.

When Activity-cells in states WAIT, COMP and E I T H get control, they trace their Sons to check if the values waited for have been produced. Activity-cells in state T R A V trace pointers in their Stack pvar and Env-cells trace pointers to Activity-cells. Notice that this tracing operation is done in parallel on activities in the same state and on Env-cells. Tracing causes a reorgani- zation of connections (references) among CM- cells. A particular pointer, top, is traced at each (MIMD simulation) cycle; it points to the actual graph's root. When (trace top) returns a Parlam basic value, that value is the final value of the computation.

5.3.2. Value propagation Polling has a main drawback: tracing causes

many reading operations that do not contribute to the computat ion advancement (often the value waited for has not been produced). In the value propagat ion approach, each activity completing its evaluation propagates its final value to every CM-cell that needs it.

Following the value propagation approach, the Refs field in Activity-cells' data structure has two list pvars, a Msg-arrival pvar for arrival of values (actually waited for or for Stack updating) and a second one, the Waited-by list pvar for references to Activity-cells or Env-cells that need the evalua-

7 A processor in WAIT or EITH state can be waiting for at most two values. This is due to the reduction rules and is independent of the synchronization model.

tion result once it will be available. The Waited-by pvar can also be used as a reference counter for garbage collection. We give in the following an informal description of CM-cells' behavior con- cerning synchronization, based on their state: • T R A V activities

- For each value in Msg-arrival update Stack. - Fetch next instruction from Control-cell Pc. - Execute applicable transition when selected. Creating T R A V or SUSP activities, the creator-cell 's address has to be added to their Waited-by list. Adding (or deleting) an Activ- ity-cell's address to the Stack (or to the Env- cell) involves updating the Waited-by list of the referenced cell.

• WAIT activities. - Poll on their own first locations of Msg-arrival (corresponding to polling's Son 1-2). Upon arrival they update Stack and change their state to TRAV.

• COMP activities. - If the final value is a basic value, forward it to processors in Waited-by together with their own addresses. These processors can be Activ- ity- or Env-cells. The "producer" address is used by the receiving processor to properly update Stack or the list of values in Env-cells. After propagation of the basic value a COMP activity has an empty Waited-by and can collect itself. - Else (if the final value is not a basic value, but an activity's address) do nothing. COMP chains remain until a last-in-chain activity produces a basic value; then, one cycle at a time, the chain collapses.

• Env-cells. They have an Msg-arrival pvar too. Based on received messages, Env-cells' processors scan their Values pvar updating values. Special monitoring is done at each M I M D

simulation cycle on the graph's root Activity-cell. When it changes to the COMP state and its final value is a basic value, that value is the computation result.

5.4. Polling ~'ersus r, alue propagation

We evaluate these two synchronization approaches in terms of space and time overhead. With value propagation we have a space overhead due to Msg-arrival and Waited-by pvars. The

240 C. Bettini, L. Spampinato

Msg-arrival length is limited by the number of Stack elements for Activity-cells and by the number of Values entries in Env-cells 8. The theoreti- cal maximum length of Waited-by is the maximum number of activities. Based on simulation results the maximum length of Msg-arrival has been lower than 10, while for Waited-by it has been lower than 100. Considering that each message takes 32 bits and each address 16 bits, there is enough local memory to store these structures.

Synchronization involves a strong time over-

head. The most significant overhead is due to inter-processor communication. Based on bench- marks, only 20% of the polling operations are productive. Using the same benchmark programs, the number of unproductive remote accesses is compared with the time overhead involved by the maintenance of the Waited-by list in the value propagation approach. We will not report details on each transition evaluation in terms of remote accesses, but, as an example, a function application involves creating a new activity and transfer- ring argument references from Stack to an Env- cell; following the value propagation approach, the processors assigned to the arguments evaluation have to be notified to send their final value to the new location (Env) and this involves a number of remote accesses equal to the number of arguments. There is an extra time overhead in value propagation due to internal stack updating based on Msg-arrival values at each cycle. A local memory operation is very cheap compared to inter-processor operations, however we should

s This number, in turn, is limited by the maximum number of parameters in function calls multiplied by the program depth.

note that in the specific case of a SIMD machine, this involves a complete scan of the maximum length stack for each message in the longest Msg-arrival list. Nevertheless, the high overhead, the number of remote accesses seems to be signif- icantly reduced using value propagation.

Considering idle processors recovery, value propagation has a "built-in" garbage collector for activities in the Waited-by list, while in the polling approach a mark & sweep algorithm should be implemented involving extra time overhead.

6. Conclusions

Research on possible implementations of concurrent functional languages on massively parallel architectures still represents a rich and productive research area. In this paper we presented Parlam, a functional language with primitives for parallel control, and a new abstract machine Par- SEC, derived from the SECD machine, for the execution of compiled Parlam programs. Based on this abstract machine, we discussed an implementation under the Connection Machine considering particularly synchronization problems. We can emphasize a contribution to three different topics: - execution of parallel functional programs in

the SECD framework; - M I M D machines simulation with very fine

granularity SIMD machines; - possibility of taking advantage of the CM mas-

sive parallelism without worrying about the algorithm distribution among processors, and without using specific primitives for parallel operations on data (such as in * Lisp).


Appendix A

A. 1 Parlam syntax

Syntax follows BNF formalism. Strings in bold are language reserved symbols, whereas those in angular brackets are obvious categories. Parentheses, brackets and commas are also reserved symbols. Expr : Constant I Boolean [ Ref I Conditional I

Lambda-Expr I Application I Let I Letrec I Same I Init I Wait I Eith I Over I Arith-expr I Input I Output I Enclosed

Constant : (numeral)I ' (s t r ing) ' Ref : (string) Boolean : True I False Conditional : If Test Then Thenexp Else Elseexp Lambda-Expr: Lambda Parms.Expr Parms : empty I Ref I Ref Parms Application : Func (Args) Args : empty I Expr I Expr, Args Let : Le~ Bindings In Expr Letrec : Letree Bindings In Expr Bindings : Ref = Expr [Ref = Expr Bindings Init : lnit Act In Body Wait : Wait Act In Body Eith : Either Expl Or Exp2 Same : Same Expl As Exp2 Over : Over Act Arith-expr : LeftExp Arith-op RightExp Input : Input Ref In Body Output : Output Expr In Body Enclosed : (Cont) I[Cont] Act, Func, Test, Expl, Exp2, Body, Thenexp, Elseexp, Cont : Expr A r i t h - o p : +1 *1 - I / I = = I <

A.2 Semantics (ParSEC transitions)

We report transitions between successive configurations of the activities' graph according to the polling based model. We remind that transitions can be concurrently applied. Given a Parlam expression we report its compiled form in the intermediate language and the transitions involving each intermediate language instruction introduced. For reasons of space some minor expression types are omitted.

Notation. A configuration is represented by a set of activities enclosed in parentheses. Each activity is named with the character A followed by a distinct index for each distinct activity. Its representation is enclosed within braces. The State, s(tack), e(nv), c(ontrol) and References registers are represented in this order. Their contents are often enclosed in parentheses with one or more items in front explicitly reported, while the rest is represented with the notation ".x" (where x will be, respectively, s, e or c). In the compiled form, an expression enclosed in square brackets stands for its compiled form, while the notation [exp] {xl. . .xn} stands for the compiled form of exp where each occurrence of the reference xk is compiled into "LD k".

242 C Bettini, L. Spampinato

Ref. string ---> LD m

(A1 = {.. .} . . . A k = {TRAV s ( w l . . . A m . . . w p ) (LD m . c ) ( . . . )} ...

Am = {SUSP s' e' c' ( . . . ) } . . . A n = { . . .})

(A1 = {.. .} . . . A k = {TRAV (Am. s) ( w l . . . A m . . . w p ) c ( . . . ) } .. .

A m = {TRAV s' e' c' ( . . . ) } . . . A n = {. . .})

(A1 = { . . . } . . . A k = {TRAV s ( w l . . . w m . . . w p ) (LD m. c ) ( . . . ) } . . . A n =-{. . .})

11 (A1 = {. . .} . . . A k = {TRAV (wm. s) ( w l . . . w m . . . w p ) c ( . . . ) } . . .An = {. . .})

Conditional.

If Test Then Thenexp Else Elseexp ~ [Test] SEL ([Thenexp] COMP) ([Elseexp] COMP)

(A1 = { . . . } . . . A k = {TRAV (True. s) e (SEL (th) (e l ) . c ) ( . . . ) } . . .An = { . . .})

( A I = { . . . } . . . A k = { T R A V ( A n + I . s ) e c ( . . . ) } . . . A n = { . . . }

A n + l = {TRAV s e (th) ( . . . ) } )

- - - (A1--- {- _._ }__.~k---{~F-RAV (False. s) e (SEL (th) (e l ) . c ) ( . . . ) } . . .An = {. . . })

(A1 = {.. .} . . .Ak = {TRAV ( A n + 1. s) e c ( . . . ) } ... An = {.. .}

A n + l = {TRAV s e (el) ( . . . ) } )

(A1 --{___f.__/~ = {TRAV (Aj. s) e (SEL (th) (e l ) . c ) ( . . . )} . . .

Aj = {.. .} . . .An = { . . .})

(A1 = {.. .} . . .Ak = {WAIT ( ( ) . s) e c (Aj)} . . .Aj = {. . .} . . .An = { . . .})

( a l = {. . . } . . . A k = {TRAV (Aj. s) e (COIVIP_c)(___)}_..An = { . . .})

( A I = {.. .} . . . A k = {COMP ( ) ( ) ( ) ( A j ) } . . . A n = {. . .})

Lambda-Expr. Lambda x l . . . xn. Body

LDF( [Body] {xl. . . xn} COMP)

(A1 = {. . .} . . . A k = {TRAV s e (LDF[Body]. c ) ( . . . ) } . . .An = { . . .})

(A1 = {.. .} . . .Ak = {TRAV ([[body]. e] . s) e c ( . . . ) } . . . A n = { . . .})

Application. Func (arg l . . . argm) ~ MKPR([argl]COMP). . .

MKPR([argm]COMP) [Func]AP m


(A1 = { . . . ) . . . A k = {TRAV s e (MKPR[code]. c) ( . . . ) } . . .An = { .... })

( A I = { . . . } . . . A k = { T R A V ( A n + I . s ) e c ( . . . )} . . . A n = {.. .}

A n + I = { S U S P ( ) e c o d e ( . . . ) } )

(A1 = {...} . . . A k = {TRAV ( [ c ' . e ' l A k l . . .A km , s) e (AP m . c ) ( . . . )} . . . A n = {-_.~})

Let

( A I = { . . . } . . . A k = { T R A V ( A n + I . s ) e . c ( . . . ) } . . . A n = { . . . }

A n + l = {TRAV ( ) ( A k l . . . A k m . e ') c' ( . . . ) } )

. . . . . . -(A-1- - -t _ -. : f . -. : -=- ( i -A-j - ; d C.-. : km- : s f -A-P- m- .-c f ( _ : .- f -. : .- . . . . . .

Aj = { . . . } . . . A n = {. . .})

(A1 = {.. .} . . . A k = {WAIT ( ) . s e (AP m . c) (Aj)} . . .Aj = {.. .} . . .An = {. . .})

Let.

xl = expl MKPR([explICOMP)

x2 = exp2 MKPR([exp2]COMP) . . . - - - - - > . . .

xm = expm MKPR([expm]COMP)

In Body LDF([Body]{xl.. .xm}COMP) AP m

Same (similar to Lisp EQ).

Same Expl As Exp2 ---, [Expl] [Exp2] ~,MI.;

( A I = { . . . } . . . A k = { T R A V ( x x . s ) e ( S A M E . c ) ( . . . )} . . . A n = { . . . } )

U (A1 -- {.. .} . . .Ak = {TRAV(True. s) e c ( . . . )} ... An = {.. .}

(A1 = {...} . . .Ak = {TRAV ( x y . s) e (SAME. c) ( . . . )} . . .An = {. . .})

(A1 = {.. .} . . .Ak = {TRAV (False. s) e c ( . . . ) } ... An = {. . .})

Init. Init Act In Body ~ [Act] ('{)N'I' ([Body] COMP)

(A1 = {. . .} . . . A k = {TRAV ( a j . s) e (CONY Body. c)( . . . )} . . .

Aj = {SUSP . . . } . . . An-- { . . .})

(A1 = {.. .} . . .Ak = {TRAV ( A n + l . s) e c ( . . . )} . . .Aj -- {TRAV ...} ...

An = { . . . } A n + l = {TRAV s e Body ( . . . ) } )

Wait. Wait Act In Body ---, [ACT] SEQ ([Body] COMP)

(A1 = {...} . . .Ak = {TRAV (Aj . s) e (SEQ Body. c) ( . . . )}

. . .A j={SUSP . . . } . . . A n = { . . . } )

(A1 = { . . . } . . . A k = { T R A Y ( A n + 1 . s) e c ( . . . ) } . . . A j = { T R A Y . . .} . . .

An = { . . . }An+ 1 -- {WAIT s e Body (Aj)})

Either.

244 C. Bettini, L. Spampinato

Either Expl Or Exp2

MKPR ([Expl] COMP) MKPR ([Zxp2] COMP) EITH

(A1 = { . . .} . . . A k = {TRAV (Aj Ai. s) e ( E I T H . c) ( . . . ) } ... = {SUSP . . . } . . . A i

Ai = {SUSP .. .} . . . A n = { . . . } )

U (A1 = {. . .} . . . A k = {EITH ( ( ) . s) e c (Aj Ai)} . . .Aj = {TRAV .. .} ...

Ai = {TRAV .. .} . . . A n = { . . .} )

Over.

Over Act --* [Act] OVER

(A1 = {. . .} . . . A k = {TRAV (Aj . s) e ( O V E R . c) ( . . . ) } . . .Aj = {. . .} . . . A n = { . . . } )

11 (A1 = { . . .} . . . A k = {TRAV (False . s) e c ( . . . ) } . . .Aj = {. . .} . . . A n = { . . . } )

(A1 = {. . .} . . .Ak = {TR,~V (v . s) e ( O V E R . c) ( . . . ) } . . _ , ~ - - - - {__ . ] ) . . . . . .

(A1 = {. . .} . . . A k = {TRAV ( T r u e . s) e ( O V E R . c) ( . . . ) } . . .An = { . . .1 )

Moreover: Waiting Activities updating.

The following transitions are more general than required for Parlam primitives, since they consider m possible entries in Refs; in the case of Parlam, m < 2

(A1 = {. . .} . . . A k = {WAIT s e c ( w l . . . A j . . . w m ) } .. .

Aj = {COMP ( ) ( ) ( ) (x)} . . . A n = { . . . } )

(A1 = {. . .} . . .Ak--- {WAIT s e c ( w l . . . x . . . w m ) } . . .

Aj = {COMP ( ) ( ) ( ) (x)} . . . A n = { . . . } )

(A1 = { . . . } . . . A k = {WAIT s e c (v l . . . vm)} . . . A n = { . . . } )

(A1 = {. . .} . . .Ak = {TRAV ( v l . . . v m . s) e c ( . . . ) } . . .An = { . . .} )

- - ( A l = { _ _ . ] _ _ . - , ' ~ = { e I T H s e c ( A k l _ _ . ~ r k ] _ _ . ~ c m ) } . . . - - -

Akj = {COMP ( ) ( ) ( ) (vkj)} . . . A n = { . . . } )

(A1 = {. . .} . . . A k = {TRAV (vkj. s) e c ( . . . ) } . . .

Akj = {COMP ( ) ( ) ( ) (vk j )} . . .An = { . . . } )

- - ( A l = { . . . } . . . A k = { E I T H s e c ( / ~ k l . A k j . . . A k m ) } . . . - - - A k j = { C O M P ( ) ( ) ( ) ( A j ) } . . . A j _ _ . . [ j . . . A n = { . . . } )

11 (A1 = { . . . } . . . A k = {EITH s e c ( A k l . . . A j . . . A k m ) } . . .

A k j = {COMP ( ) ( ) ( ) ( A j ) } . . . A j = { . . . } . . . A n = { . . . } )

Environment updating.

(A1 = {. . .} . . . A k = {X s ( A k l . . . A k j . . . A k m ) c ( . . . ) } . . .

A k j = { C O M P ( ) ( ) ( ) (y) I . . . A n = { . . . } )

(A1 = {. . .1 . . . A k = {X s ( A k l . . . y . . . A k m ) c ( . . . ) 1 . . .

Akj = {COMP ( ) ( ) ( ) (y)} . . .An = { . . . } )


(A1 = { . . . } . . . A k = { T R A V ( v l . . . v m . s) e c ( . . . ) } . . . A n = { . . . } )

. A R = . . . .

Akj = { C O M P ( ) ( ) ( ) (vkj)} . . . A n = {...})

(A1 = { . . . } . . . A k = ( T R A V (vk j . s) e c ( . . . ) } . . .

Akj = {COMP ( ) ( ) ( ) (vkj)} . . . A n = { . . . } )

(A1 = {. _ .]__.-P~ =-(EI--T-I~ s e c ( A k l . _.~k]__.~X~km;}-.~_ . . . . Akj = {COMP ( ) ( ) ( ) (Aj)} . . .A j = { . . . } . . . A n = { . . . } )

(A1 = { . . . } . . . A k = {EITH s e c ( A k l . . . A j . . . A k m ) } . . .

Akj = {COMP ( ) ( ) ( ) (Aj)} . . .A j = { . . . } . . . A n = { . . . } )

Env i ronment updat ing .

(A1 = { . . . } . . . A k = {X s ( A k l . . . A k j . . . A k m ) c ( . . . ) } . . .

Akj = {COMP ( ) ( ) ( ) ( y ) } . . . A n = { . . . } )

U ( A l = { . . . } . . . A k = { X s ( A k l . . . y . . . A k m ) c ( . . . ) } . . .

Akj = {COMP ( ) ( ) ( ) (y)} . . . A n : { . . . } )

References

[1] J. Backus, Can programming be liberated from the Von Neumann style? A functional style and its algebra of programs, Commun. ACM 21 (8) (1978) 613-641.

[2] C. Bettini, La Connection Machine come architettura hardware per l'implementazione di un linguaggio ap- plicativo parallelo, Master Thesis in Computer Science, University of Milano, (1987).

[3] E. Decio, Sull'implementazione di un linguaggio applica- tivo concorrente, Master Thesis in Physics, University of Milano, (1983).

[4] D. Friedman and D. Wise, An approach to fair applicative multiprogramming, in: G. Kahn and R. Hilner, eds., Lect. Notes in Comp. Science, Vol. 70 (Springer, Berlin, 1979) 203-225.

[5] P. Henderson, Functional Programming, Application and Implementation (Prentice-Hall, Englewood Cliffs, N J, 1980).

[6] C. Hewitt and H.G. Baker, The incremental garbage collection of processes, A C M SIGPLAN Notices 122 (8) (1977) 55-59.

[7] W.D. Hillis, The Connection Machine (MIT Press, Cam- bridge, MA, 1985).

[8] H.T. Kung, The Structure of Parallel Algorithms, Advances in Computers, Vol. 19 (Academic Press, New York, 1980), 65-112.

[9] P.J. Landin, The mechanical evaluation of expressions, Comput. J. 6 (4) (1964) 308-320.

[10] P.J. Landin, The next 700 programming languages, Com- mun. ACM 9 (3) (1966) 157-166.

[1l] G. Prini, Applicative parallelism, Draft, Stanford Univer- sity (1980).

[12] G. Prini, Explicit parallelism in Lisp-lie languages, in: Proc. of the 1980 Lisp Conference, Stanford (1980).

[13] Thinking Machines Corporation, * Lisp Reference Man- ual (TMC, Cambridge, MA, 1988).

Claudio Bettini received an M.S. in computer science from the University of Milano, Italy, in 1987. His thesis work was an investigation on computational models for a concurrent functional language suitable for the implementation on SIMD architectures. In 1988-89 he was a PostDoc at the Scientific and Engineering Computa- tions department of IBM-Kingston, NY. He is currently working in a research project on terminological languages at the Computer Science De-

partment of the University of Milano.

Luca Spampinato holds a degree in Physics. He is a founder of Quinary, a high-tech Italian company devoted to knowledge-based system technology and system integration. His iesearch interests include terminological and hybrid knowledge representation systems and DB-KB integration. He is author of more than 20 international publications, 1989 he was elected offi- cer of the Italian Association for Arti- ficial Intelligence (AI * IA).

the connection machine opportunity for the implementation of a concurrent functional language

Documents