architectural support for inter-stream communication in an msimd system

Future Generation Computer Systems 11 (1995) 617-629

Architectural support for inter-stream communication in an MSTMD system

Vivek Garg, David E. Schimmel * Computer Systems Research Laboratory, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta,

GA 30332-0250, USA

!\bstr;ltt

This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However, the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to scalar and parallel variables, and synchronization for parallel communication operations. We propose a sharing mechanism for scalar variables and identify a strategy which allows synchronization of scalar variables between multiple streams. The techniques considered are based on a bit-interleaved register file structure which allows fast copy between register sets. Hardware cost estimates and timing analyses are provided, and comparison with an alternate scheme is presented. The register file structure has been designed and simulated for the HP 0.8pm CMOS process, and circuit simulation indicates access times are less than six nanoseconds. In addition, the impact of this structure on system performance is also studied.

Keywordr: Superscalar SIMD; MSIMD; Control parallelism; Synchronization; Interleaved Register File

I. Introduction

SIMD architectures are well suited for data parallel applications. It has also been shown that data parallel applications do exhibit a limited amount of control parallelism [ll. Some studies have approached this issue through mixed-mode programming [2,4,14,151. Other work has considered the interpretation of MIMD programs [6,17]. Yet another approach for exploiting this control parallelism lies in extending the ability of SIMD to allow multiple concurrent streams. One such

* Corresponding author. Email: [email protected]

study has proposed an architecture termed superscalar SIMD [1,13], which specifies a small set of control processors with the ability to control arbitrary PE partitions. While a SIMD algorithm will sequentialize when executing a control parallel construct, i.e. an if-then-else on a parallel predi- cate, a superscalar SIMD will execute the control parallel branches of the code concurrently, which results in a number of data dependency issues. This paper investigates these data dependencies and hazards that arise as a consequence of superscalar SIMD. We then discuss the design of a register file structure that provides architectural support for interstream communication of scalar variables. Finally, we present analysis and simulation results followed by some conclusions.

0167-739X/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDI 0167-739X(95)00028-3

618 K Garg, DE. Schimmel / Future Generation Computer Systems 11 (I 995) 617-629

2. Superscalar SIMD architecture

Superscalar SIMD architecture has been de- scribed in [1,13]. A k-order superscalar SIMD machine has the ability to issue up to k simultaneous SIMD instructions. Each PE then selects one of these instructions to execute according to some local information. We focus on SIMD because one of the strongest arguments in support of SIMD is that making processors simpler implies higher levels of integration are possible [5]. Higher levels of integration, in turn, mean that for a given area of integrated circuit, the architecture can have significantly more processors available. Exploitation of the latent control parallelism provides a significant opportunity to improve the efficiency of such a system.

There are, of course, many candidate architectures which we might consider. However, they all must provide a certain basic functionality. We assume an architecture which consists of an array of PEs and a set of tightly coupled instruction issuing stream controllers (SC). The PEs have the added capability to execute one of a number of instruction streams according to local information. We place no constraints on the partition of processors executing a given instruction stream. In other words, each SC is capable of controlling an arbitrary sized partition of PEs. An SC is responsible for communication with other SCs when a new partition is formed or alternatively when two partitions are merged. The stream re- questing a partitioning is considered the parent, while the sub-partition formed as a result is considered the child. This is a subtractive process as it removes the member PEs in the child partition from the parental partition. It is necessary to implement a barrier synchronization at the point where a child stream must re-merge with the parental stream. The mechanism to accomplish this is for the terminating SC to issue a join instruction which indicates to the parent SC readiness to rejoin the parent stream. In this paper, we refer to data which is defined on a stream controller as scalar data or variables, and data which is defined on the PE array as parallel data.

An SC is functionally similar to the array con-

troller in a standard SIMD machine [9]. It broad- casts instructions and data to the PEs, and uses feedback from the PEs to control the behavior of the array. This functionality is limited to the partition of the PE array that the SC is controlling. As we have mentioned, an SC may also issue fork and join instructions to effect a partitioning or regrouping of the PEs in its control. To be able to fork and join efficiently, the SCs must communicate using a few control signals, and share scalar data using for example, a common register file structure.

When a fork instruction is issued by an SC, first the parent SC acquires an idle SC and repli- cates the contents of its general purpose register file in that SC. It simultaneously transfers the starting instruction address to the child SC and they begin executing concurrently. Hardware support for fast initialization allows for single cycle dispatch of a child stream.

A join instruction is used by an SC to indicate stream completion. When both parent and child have registered a join, the PEs in the child stream are admitted back to the parent stream. We assume that sequential control semantics of typical SIMD programming models must be preserved. Hence access to shared variables in an ekre clause must resolve true (RAW) and anti (WAR) dependencies [lo] with the corresponding then clause. To maintain this consistency, scalar variables modified in the register file of the child SC must be merged into the parent SC’s register file. Again the fast context switch hardware support allows this operation in a single cycle.

3. Data dependencies in superscalar SIMD

Superscalar SIMD allows simultaneous execution of multiple instruction streams. Disallowing communication during the execution of concurrent streams, as in the Opsila machine [2], simpli- fies the architecture considerably. Allowing communication of scalar and parallel variables while concurrently executing instruction streams may give rise to data hazards. Therefore, we must develop mechanisms to allow scalar data to be shared between concurrent blocks, and to ensure

V. Gnrg, D. E. Schimmel / Future Generation Computer Systems I1 (19%) 617-629 619

int a.b.c.d, max_rows

plural int e.f,g,h. row

If (row < nm_rows/2)

a=2*tJ

d=a+c

ieeh i: f = 0.5 * e

Sl

wnd '

d5e

L=arb

c=sth

> S,

h=g!f

ad-if

!a,

:.:.:.:.:.:.:.:.:.: gq$m: s cap& p&w

Scalar WAR Parallel WAR Parallel RAW

5th

g/f

Fig. 1. Data hazards caused by two communicating concurrent SIMD streams: (a) Control parallel code segment, (b) Scalar and parallel data hazards.

correctness of the parallel data being accessed by the independently executing instruction streams.

Concurrent execution of instruction streams implies that the strict instruction order that was implicit in the SIMD paradigm, is only a partial ordering on a superscalar SIMD machine. As with any other architecture allowing out of order instruction issue, execution, or completion, there are three kinds of potential data hazards that must be avoided [8]. These are read-after-write (RAW), write-after-read (WAR), and write- after-write (WAW) hazards. Fig. 1 shows superscalar SIMD versions of these data hazards. Note that the hazards may arise in either scalar or parallel data. A RAW hazard occurs if the instruction in S, is executed before the correct

value of f is received from S,. If the instruction in S, is executed before the fetch operation in S, has received the value of e, then f gets computed incorrectly because of the WAR hazard. The WAW hazard can be detected and removed in the usual manner [S].

3.1. Scalar variables

In a conventional SIMD architecture, any writes to shared scalar variables in the then block are reflected in the eZse block naturally, due to the sequential nature of execution of parallel conditional constructs. In superscalar SIMD architecture we consider initiating a new stream, given that resources are available, to execute both the then and else blocks simultaneously. As a result, the data dependencies that are automati- cally preserved in SIMD architecture, must be enforced in this execution model. Synchroniza- tion of scalar variables is necessary whenever scalar data computed in one concurrently executing block of code is referenced in another.

Consider an example, where a global OR function computed in a then block is referenced in the associated else block. In this case, some method of producer-consumer synchronization must be invoked. Because the result of the operation is deterministically available in the instruction issuing unit (SC), the synchronization may be effected through a shared lock mechanism between the peer instruction issuing units. These locks may be constructed from shared register space along with atomic synchronization operations. Note that because the proposed number of such instruction units is small [l], synchronization may be extremely fast and efficient as opposed to general synchronization between many MIMD processors.

We identify a scalar variable as a shared scalar variable if it is accessed in two or more concurrent instruction streams. To illustrate the occur- rence of data hazards in the superscalar SIMD architecture, let us assume that we have two concurrently executing SIMD instruction streams, S, and S,, which have data dependencies between them. Let x be the scalar variable shared by the two streams.

620 1/ Garg, D. E. Schimmel / Future Generation Computer Systems 1 I (199.5) 617-629

If x is only read in both streams, then only RAR dependency is present which does not result in any data hazards. However, it does require the sharing of data between the two streams. If x is written in S, and read in S,, a RAW data dependency arises. With a standard SIMD architecture, this dependency does not cause a hazard, due to the sequential execution of parallel conditional constructs. In superscalar SIMD multiple streams may execute in parallel, hence producer- consumer synchronization is required to ensure that the correct value of x will be read by S,.

If x is read in S, and written in S,, a WAR data dependency is present. A mechanism is required to ensure that S, does not overwrite the previous value of x until S, has read it. If x is written by both streams, it results in a WAW hazard. In both the RAW and WAR dependencies, S2 writes X, hence we should ensure that the final state of x is consistent with its traditional SIMD counterpart. This implies that the values of x in both streams should be synchronized, and furthermore, the final value of x should be the value assigned to it in S,. This requires a mechanism to guarantee that when the two streams merge, the shared scalar x has the correct value.

3.2. Parallel variables

Since sharing of parallel variables is achieved via the interconnection network and communication primitives, any data communication which requires crossing the boundary of the partition it is executing within, requires explicit synchronization to avoid potential data hazards. For instance, consider the superscalar SIMD if-then-else construct shown in Fig. 2. Here, both the then and else blocks are executing simultaneously which may result in data hazards. From a SIMD point of view it is possible for the then stream to issue a send on parallel data that may be referenced in the else stream, or alternatively processors executing the else stream may fetch parallel data produced by processors executing the then block. Both of these operations need synchronization to obtain correct results. A fetch on parallel data issued in the then block, and a send on parallel data issued in the else block do not interact with

if f(x) then Sl

else s2

-end operation ------fetch operation

Fig. 2. Representation of all possible communication behveen concurrent if-then and else blocks.

their counterparts because of the implicit sequential ordering enforced by the semantics of the SIMD model. These operations do not cause hazards if all synchronization is complete upon reaching endif.

Again, consider when parallel data needed in the else stream is generated in the then stream. This creates a RAW hazard if the communication and reference are executed out of order. A WAR hazard may occur when processors executing the then stream require data from PEs executing in the else partition. It should be noted that this data dependency is only observed when conditional constructs are enclosed in a loop like structure, since in SIMD programs any data dependency between the then and else blocks has to be one way from the then to the else block. There- fore, a data dependency from the then stream to the else stream can only exist between two separate iterations of the loop. In a SIMD machine this data would not be affected by the else stream until after the then stream has completed executing. In superscalar SIMD, the concurrently executing else stream may overwrite the data required by the then stream before it is actually referenced creating a WAR hazard.

WAR hazards may always be avoided by using techniques such as aliasing data or reallocating memory. The RAW hazards can not be avoided without explicit synchronization, since they require the presence of data which has not been computed yet.

K Garg, D.E. Schimmel /Future Generation Computer Systems 11 (1995) 617-629 621

4. Specialized support for inter-stream communi- sign, but it also increases the cost of hardware cation of scalar data resources.

Our objective is to design a structure to support fast context switching between streams. The context as referred to in this paper represents the scalar data that is shared between two or more concurrently executing SIMD code segments. To ensure that this sharing of data preserves all dependencies and avoids data hazards, the structure must have mechanisms for consumer producer synchronization and selective merging between streams. Finally to achieve fast context switch, the structure should be able to accomplish single cycle copy of one or more registers between streams.

5. Distributed bus register file

There are potentially a number of designs that satisfy the above criteria; we will discuss two alternatives in this paper. The first approach consists of allocating physically separate register files for each of the streams, which communicate via a shared bus. The bus is word wide and has the ability to operate in broadcast mode. The basic cell of the register files is based on an SFUM cell. We will refer to this design as the shared bus scheme.

We propose a structure consisting of k bit-interleaved register files. Each bit of the register file has two read ports and one write port. It also has local read and write ports to facilitate single cycle bit copy, consequently allowing single cycle copy of the entire register file. Each register has two corresponding status bits, busy(B) and dirty(D). The busy bit indicates whether the value held by a particular register is in a valid state. If the busy bit is set in the child stream, the register should not be read or written by the child stream until the busy bit is cleared. The dirty bit is used to indicate that the child (else) stream has modified the corresponding register value. If the dirty bit is set in the child stream, the register value should be merged back into the register file of the parent stream, so that the data values are consistent with those in a conventional SIMD execution model. If the dirty bit is clear, no action is required since the shared scalar was only read by the child stream.

The second approach consists of a bit-interleaved register file design, also referred to as the distributed bus scheme. Each register file may be addressed independently, and can be assigned to any one stream. The basic cell in the register file consists of 4 SRAM cells and a local copy bus. To perform a copy operation, the SRAM cell in the source register file writes the local copy bus, while the one in the destination register file reads the local bus and stores the bit. Since the entire register file structure is made up of these basic cells, it is possible to replicate the entire contents of one register file into another in a single cycle.

Both of these techniques require additional status bits to provide producer consumer synchronization and selective merging. The cost of the copy operation is highly dependent on the number of registers copied and the type of scheme used. Of course, the distributed bus register file has the clear advantage when copying, due to the limited bandwidth available in the bus based de-

Fig. 3 shows a block diagram of the register file and other entities with respect to it. The following control signals are available for accessing the register files: ext_rdl, ext _rd2, int _ rd, ext_wt, int _wr. These control signals correspond to activity at the different ports in the register file. There are four types of operations that are required to use the distributed bus register file for scalar variable sharing. The first operation, set-busy, requires a register file and a register address, and is used to set the busy bit of the addressed register. The second operation, rf_copy, requires the parent and child register file addresses, and copies all the registers from the parent register file to the child. In addition to copying all the data, it also copies the busy bits between the parent and child register files, and it clears the dirty bits in the child register file. The third operation, r-copy, similar to the rf_copy operation, requires the parent and child register file addresses and a register address. It copies the

622 K Garg, D. E. Schimmel/ Future Generation Computer Systems 11 (1995) 617-629

Fig. 3. Block diagram of the register file and other compo- nents interacting with it.

specified register from the parent to the child, and clears the busy bit in both the parent and child registers. The fourth operation, merge, also requires the parent and child register file addresses. It copies all registers in the child register file with their dirty bits set, to the parent register file.

5.1. Application to super-scalar SIMD

The distributed bus register file structure is used to remove data hazards and to provide synchronization for scalar variables between concurrently executing SIMD streams. A few cycles prior to the creation of the else stream, provided a stream controller is available, the entire parent register file is copied into the child register file. Consequently the then and ekre blocks will execute in parallel, accessing the scalar data from their local register files. At the end of the execution of the two streams, i.e. at the endif, the two streams are merged together while selectively copying all registers that were touched in the else stream back into the then stream. In the following discussion we will refer to the ith register in

the parent and child register files as RP and R; respectively.

Standard dependency analysis in the compiler can be used to set the busy bits for registers in the parent register file that will require synchronization to avoid RAW hazards between the then and else streams. At the time of the register file copy, the busy bits are transferred to the child register file as well. Until the busy bits are cleared in the child register file, the registers corresponding to the busy bits that are set should not be read or written by the child stream. The parent stream can write its registers regardless of the status of its busy bits. An r-copy operation on the registers blocked in the child stream is performed when valid data in the parent register file is available. Following this strategy, the busy bit can be used to implement producer-consumer synchronization to avoid scalar data hazards between the then and else streams. Fig. 4 shows a RAW data dependency between two concurrently executing instruction streams. The scalar variable b is shared between the two streams. At the time when S, is created from S, and placed on a new SC, the register file of the parent stream is copied into the child’s register file. Since the variable b resident in R5 cannot be read by S, until the add operation has been completed in S,, the busy bit in the child register file is set to indicate an invalid state (Fig. 4(b)). After the add instruction in S,, when the value of b in R5 is updated, an r-copy operation is performed on R5 and the busy bits corresponding to the parent and child registers are cleared (Fig. 4(c)), allowing the add instruction in S, to execute.

WAR hazards are treated somewhat differ- ently than RAW hazards using this register file structure. After the creation of the child stream, the child register file has a copy of all variables it shares with its parent. However, in this case no producer consumer synchronization is required, hence the busy bits are not relevant. Anytime a scalar is modified in the child stream, the dirty bit corresponding to the variable is set in the child register file. This enables the two streams to independently manipulate the shared variable, and the WAR hazard is effectively eliminated via register copying [S]. At the join, the register files

I/: Garg, D.E. Schimmel /Future Generation Computer Systems 11 (1995) 617-629 623

Fig. 4. Resolving a RAW scalar data dependency between two concurrently executing SIMD streams. (First operand in instruction format is the destination address).

are merged together and any registers with the dirty bit set in the child register file are copied back into the parent register file. This maintains consistency with the sequential execution of parallel control constructs in standard SIMD.

Fig. 5 shows an example of a WAR dependency between two concurrently executing instruction streams. Shared variables b and c are copied from the parent register file to the child’s register file at the fork time (Fig. 5(b)). No flags are set at fork time, since we are only concerned with WAR dependency in this example. Both streams are able to access the shared variables independently. The add instruction in S, reads the unmodified value of b from its register file, while the add instruction in S, can overwtite the

value of b in its register file. After the add instruction in S,, the dirty bit is set in the child’s R3 to indicate that the value of b in S, differs from that in S, (Fig. 5(c)). At the end of the streams, a merge operation is performed to join the two streams. At this point R3, the dirty register in the child register file, is copied into the parent register file as shown in Fig. 5(d). R4 does

Fig. 5. Resolving a WAR scalar data dependency between hvo concurrently executing SIMD streams.

624 K Garg, D. E. Schimmel / Future Genemtion Computer Systems I1 (1995) 617-629

not require any merging, since there was no in- consistency between Ri and Rs as indicated by clear dirty bit in the child register file. Further- more, any register in the child register file which remained unmodified should not be merged with the parent register file, as the value of those registers could have been modified in the parent. Therefore, it is important that only the registers marked dirty in the child register file are copied back into the parent.

WAW hazards are dealt with in a manner similar to the WAR hazard. The shared variables are copied from the parent to the child register file at fork time. Once again none of the status bits are manipulated at the time of the fork. This enables both of the instruction streams to independently write the value of the shared variable regardless of the order. When the value of the shared variable is modified in the child stream, the dirty bit corresponding to the register is set in the child register file. It is important to note that modifications in the parent stream do not effect its dirty bit. When the two streams join, the dirty registers in the child register file are copied into the parent register file.

Fig. 6 shows a WAW hazard between two concurrent SIMD streams. As shown in Fig. 6(b), the shared variable II is copied in R5 in the child register file at the fork. At this time both S, and S, can write to RS independently. As mentioned earlier and as illustrated in Fig. 6(c), the dirty bit is set in the child register file after the add in S, is executed. When the two streams join, R5 is copied from the child to the parent register file (Fig. 6(d)).

Therefore, we have the ability to resolve all RAW, WAR, and WAW data hazards between two or more concurrently executing SIMD instruction streams by utilizing this distributed bus register file structure and two status bits per register. We point out that the RAW hazard is the only non-removable data hazard present when using this scheme. The WAR and WAW dependencies would be true hazards if both streams were accessing their variables from the same physical location. The duplication of register spaces effectively removes these data hazards. The synchronization between the child and par-

(a) Concurrent Streams in SIMD

P’re$ Register File CRhiE Register File

b!;oI()I[ R;lololr/

(h) Status after fork process

R:m R;m

(c) Status after KS has heen generated in S2

(d) Status after join process

Fig. 6. Resolving a WAW scalar data dependency between two concurrently executing SIMD streams.

ent stream addresses the issue of preserving the sequential consistency of the SIMD execution model in superscalar SIMD. In a SIMD machine, if a variable is modified in the else stream, then at the end of the parallel conditional construct, the value of the variable is the same as was assigned in the else block. This semantic is enforced through the synchronization of the two streams, which results in updating all scalar variables in the parent stream that were modified in the child stream.

V. Garg, D.E. Schimmel/ Future Generation Computer Systems I1 (1995) 617-629 625

0. Shared bus register file

This solution is similar to the distributed bus register file in the sense that it is constructed by applying the basic block used to build the distributed bus register file at a macro level. The local copy bus in the distributed bus structure is replaced by a global copy bus, which does not alter the functionality, but does change the resulting performance.

6.1. Description of the shared bus register file

The design consists of individual register files with the desired number of registers, each bound to an SC. Any communication between the register files takes place over a broadcast bus which is shared by all register files. The register files have three ports, two for reading and one for writing. The register files interface with the broadcast bus via their read and write ports and some interface logic. A block diagram of the floorplan for shared bus register file scheme is shown in Fig. 7. The bus width is a factor that has a direct impact on

!__~ ~~ J L-- _.._. -1

Fig. 7. Block diagram of the floorplan for shared bus register file scheme.

the effectiveness of the structure. We specify the shared data bus to be 32-bits wide, which matches the word size of the register file and provides the ability to copy one word per cycle between any two register files. Each register also has two status bits: busy and dirty.

6.2. Application to super-scalar SIMD

In this section we consider the relative performance of the shared bus register scheme and the distributed bus register file. At one extreme, all the registers in the register file can be obliviously copied from the parent to the child at the time of the fork. This has a significant impact on performance, since each register copy operation takes one cycle. At the other extreme, if only a small subset of registers is required to be copied between the streams then copying the entire register space is unnecessary. Selective copying of registers between streams however, requires gen- erating the appropriate code to accomplish this. We may use a demand-driven model, where operands are requested from the parent SC using explicit requests. On the other hand, the compiler can identify the scalar variables shared between the parent and child streams, and issue instructions to copy only the required variables. This software overhead may be substantial enough to nullify the savings achieved via selective copy. The busy and dirty status bits available in each register can be used in a manner similar to the distributed bus scheme to resolve data hazards and provide synchronization.

7. Comparative analysis

7.1. Regkter access times

Since the distributed bus register file structure employs a single cycle copy, the entire contents of the parent register tile are made available to the child via the single cycle copy mechanism. The shared bus register file structure will likely rely on the compiler to copy the appropriate registers from the parent to the child. If the width of the shared data bus is set to a single word, every

626 V. Gnrg, D.E. Schimmel/ Future Generation Computer Systems II (199.5) 617-629

register copy will require one cycle. This implies that if there were n variables to be copied, the shared bus register file scheme would require n - 1 additional cycles for the fork as compared to the distributed bus scheme. Similarly, both schemes use the dirty status bit for data synchronization at the joins, but the separate register space scheme requires extra cycles whenever there is more than one register which is merged back into the parent register file.

Let ‘fork represent the number of fork instructions in a superscalar SIMD program. The number of join instructions Ijoin must be equal to the number of fork instructions. It may often be necessary to copy the entire register file, since multiway branches are implemented as nested binary branches, and a child process may have to supply a subset of scalar variables on behalf of its child. Hence, we will require that the entire scalar register space be copied from the parent to the child stream. Assuming n registers in the SC register file, the shared bus register file scheme takes n cycles to perform the fork, while the distributed bus register file scheme requires only One cycle. If we posit that on average half of the registers in the child stream must be merged into the parent stream, the shared bus register file scheme requires n/2 cycles to perform a join, while the distributed bus register file scheme again uses a single cycle. In general, the shared bus register file consumes an extra ((3n - 4)/2)

’ Ifork machine cycles in execution time con- trasted with the distributed bus register file scheme. Since n is typically fixed and is a relatively small number, the impact of this overhead is a consequence of the number of fork instructions in the program, i.e. the degree of control parallelism in the application.

ables with the parent stream, but its child stream shares 6 scalar variables with the original parent. As a result, the second join must copy 6 instead of 4 registers. The join instructions require merging of at most one register for any of the joins, resulting in the same performance as distributed bus register file. For this algorithm, choosing distributed bus register file over the shared bus register file scheme decreases the execution time by 17 cycles per loop iteration. Since there are p loop iteration in the algorithm for a p xp prob- lem, the total time advantage is 17~. The loop execution time per iteration is approximately 8500 cycles. The performance improvement achieved by the use of distributed bus scheme is then (85OOp)/(8500 - 17)~ = 1.002. However, we have assumed a priori that the number of scalars shared between streams was known, which may not be the case. Let us repeat the calculation, assuming all 32 registers are copied at fork time, and 16 registers require merging at the join. For the three fork instructions in our example we gain an advantage of 93 cycles per iteration, and the join instructions add another 45 cycles per iteration, yielding the total gain per loop iteration of 138 cycles. This results in a performance improvement of (85OOp)/((8500 - 138)~) = 1.02 or a 2% improvement. The performance differential between the two schemes may play a significant role in the aggregate performance of programs where many short streams execute concurrently. In these codes the high synchronization cost imposed by the shared bus register file scheme can degrade system performance considerably.

7.2. VLSI design hues

We have acquired experimental data from the The VLSI layout of register files discussed in execution trace of a data parallel LDLT algorithm this paper is based on SFUM cells. The registers [7] running on a 1024 processor MP-2 [3,121. We require two read ports and a write port. In addi- use the data to derive an estimate of the addition to these ports, the distributed bus register tional cycles used by the shared bus approach. file requires two extra ports per bit for the local The algorithm has three fork and join instruc- copy mechanism. In the interest of saving area in tions per loop iteration. The three fork instruc- the layout we implement the SRAM cells with tions have 8, 6, and 6 scalar variables that they only one of the bit and bit-bar data lines [16]. We share with the parent stream. We observe that will use the VLSI layouts to estimate the silicon the second fork actually shares only 4 scalar vari- area required for both schemes.

V. Garg, D. E. Schimmel/ Future Generation Computer Systems 11 (1995) 617-629 627

7.2.1. Chip area While functionally, the distributed bus register

file and shared bus register file are similar, the bit-interleaving and the local copy bus required by the distributed bus register file structure results in increased size of the basic cell. Let A,,,, represent the area in units of A2 for a three ported (two read and one write) SRAM cell. A,, and A, represent the area for read and write drivers respectively, and ADecoder is the area required for the logic for one 5 :32 decoder. Also

let Awiring be the area required for running control signals and power distribution buses in the basic cells, and ARouting and AStatus be the area for routing and the status bits respectively. In addition, let I,,,, and hsRAM be the length and height of an SRAM cell, and in, and ha, be the length and height of a distributed bus register file basic cell. For generality, we will allow m, n, and k to represent the word size, register count, and number of register files respectively. We assume that the area requirements for the routing, status bits, and read and write drivers are the same for either scheme, so we consider them jointly as

AMisc =Aab +AWD +Aaouting +Astatus. The decoding logic required for distributed

bus register file is slightly higher than shared bus register file, since there are more control signals required for its registers. The shared bus register file structure only uses three 5 : 32 decoders for each register file, as opposed to five 5:32 decoders for each register file in the distributed bus register file scheme. Likewise the area required for wiring the control lines in the basic cells, A Wirhg, is slightly larger in distributed bus scheme than shared bus scheme.

The basic cell of the distributed bus register file structure consists of four SRAM cells, each with 2 read ports, a write port, a pair of local read and write ports, and a single bit copy bus. The dimensions of the basic cell layout were 162h x 78A excluding the control and power distribution buses, which is approximately 4.7 times the area of a basic 3 ported SRAM cell, or 1.17 times the area of four 3-ported SRAM cells. The cost function for the silicon area used by the distributed bus register file is given as

A a, = (mnk~A)As,,, + 5M necoder

+(5k + 2)nlniA.,~;,,, +‘Misc. (1)

Based on our layout of the distributed bus register file structure, the term

6A = (&U%I)/(~SRAM~SRAM) = 4.77 (2)

for k = 4. The basic cell for the shared bus register file scheme is simply a three ported SRAM cell with 2 read ports and a write port. Each basic cell also includes the estimated area required to wire the read and write ports to the broadcast bus. The dimensions of the three ported SRAM cell designed for the shared bus register file scheme are 62A X 43A excluding the control and power distribution buses. The cost function for silicon area used by the shared bus register file is given as

A SRS = mnkASR,M + 3k;4Decoder

= 5knI SRAMAmetall +AMisc (3)

7.2.2. Scaling As the value of k, the maximum allowable

number of concurrent SIMD streams increases, the register file structures must scale accordingly. Scaling the shared bus register file structure en- tails adding more register files to the bus. The basic cell of the design remains the same, consequently it is relatively easy to adjust the size of the structure. As k changes alterations to the distributed bus register file structure requires a re-design of the basic cell, since it is bit-interleaved. One can envision a hierarchical hybrid of the above techniques to obtain larger structures without major modifications. For example, if we want to increase k from 4 to 8, we can take two distributed bus register file structures designed for k = 4 and connect them with a word-wide bus. This would help eliminate redesigning of the VLSI layout at the cost of slower interstream communication between certain SCs.

8. Kcsults

The optimal value of k, the number of SCs, is an open question. Experiments run on various SIMD applications suggest that k = 4 may be sufficient [l]. The application set consisted of SIMD implementations of neural network simulation, cholesky factorization, life, 2-D median filter, and an MM/l queue simulation.

628 K Garg, D. E. Schimmel / Future Generation Computer Systems II (I 995) 617429

wrt wrt Wi-1 wrt

J7 rd

1 -tNt l-hit I-hit I-ht SRAM Cd SRAM Cd SRAM Cell SKAM Cdl

Fig. 8. Bit interleaved SRAM cell with 4 l-bit S-ported SRAM cells with support for local copy.

The distributed bus register file structure discussed in Section 5 has been designed at the custom VLSI level. The structure is designed for k = 4, i.e. up to four SIMD instruction streams can be executed concurrently. The basis for the structure is a bit-interleaved cell which consists of 4 SRAM cells with 5 ports each as illustrated in Fig. 8. Each SRAM cell belongs to a separate logical register file, and all the SR4M cells in the basic cell are connected together via a common bus. Each of the SRAM cells have 2 write ports and 3 read ports. Of these, a pair of write and read ports is dedicated to copying data to and from the common bus. This common bus provides the ability to perform a bit copy between any two SRAM cells in the basic cell in one cycle. Since this basic cell is replicated to create the register file structure, the common bus makes it feasible to copy an entire logical register file to another in one cycle.

The extra resources required by the distributed bus structure over the shared bus structure are quite modest. The shared bus scheme is based on the three ported SRAM cell which has 2 read ports and a write port, whereas the distributed bus basic cell requires two extra ports for implementing the local copy mechanism, and the bus. The distributed bus cell layout is 162A X 78h, while the basic cell for shared bus register file is 62h X 43A. The distributed bus basic cell has 4 times the storage capacity and requires 4.7 times the area, yielding an 18% area increase per 4 bit cell. The height of the distributed bus structure is also larger than its counterpart by 2nh,,,,,,, where it represents the number of registers in the register file, and A,,ta,, is the minimum pitch

required for metal1 buses. Amortized over the number of register files, the increase in height per register file is (n/2)A,,,,,,.

The distributed bus basic cell with bit-interleaved SRAM cells was designed and simulated. The structure was laid out using custom VLSI tools in the Mosis 0.8 micron scalable CMOS technology, and the design was extracted and simulated using Hspice. The simulation results indicate local copy time, i.e. an SRAM cell writes the copy bus, while another cell reads the copy bus, of Under 3ns. Read times for the bit-interleaved cell are under 6ns, while write times are under 2ns. It should be noted that the read and write times presented include the delay incurred from the read and write drivers.

9. Conclusions

Superscalar SIMD architectures allow concurrent execution of multiple SIMD streams. We have examined the data dependencies in scalar and parallel data structures that result from the superscalar execution model. We investigate the conditions under which these SIMD data dependencies cause data hazards in superscalar SIMD. We have proposed a register file structure that provides support for communicating scalar data between concurrently executing SIMD streams. The structure was designed to support superscalar SIMD architecture, and enables fast stream creation and synchronization. A bus based alter- native for sharing scalar data was also analyzed. Analysis of the hardware resources required to implement the structures indicate that there is an area penalty of approximately 18% when using the distributed bus register file. The timing analysis revealed that our proposed structure outper- formed the bus based scheme, even for a relatively poor example. For programs with fork and join synchronization in the inner loop, the differ- ence in the performance of the two schemes will be significantly larger.

References

[l] J.D. Allen, V. Garg and D.E. Schimmel, Analysis of control parallelism in SIMD instruction streams, Proc.

K Garg, D.E. Schimmel/ Future Generation Computer Systems I (1995) 617-629 629

Fifrh Symp. on Parallel and Dirtributed Processing. Dallas, Texas, (Dec. 1-4, 1993) (IEEE Computer Society Press) 383-390.

[2] M. Auguin, OPSILA computer, Proc. Int. Workshop Alg. and Arch. (North-Holland, 1986) 143-53.

[3] T. Blank, MasPar MP-1 architecture, Proc. COMP-CON Spring 90- 35th IEEE Comp. Sot. Int. Con&, San Fran- cisco, CA, pp. 20-24.

[4] T. Bridges, The GPA machine: A generally partitionable MSIMD architecture, Proc. ZXrd Symp. on the Frontiers of Massively Parallel Computation, College Park, MD (Oct. 8-10, 1990) (IEEE Computer Society Press, 1990) 196-203.

i]

iI

171

]81

[91

[lOI

[ill

1121

[131

1141

T. Bridges, SW. Kitchel and R.M. Wehrmeister, A CPU utilization limit for massively parallel MIMD computers, Proc. Fourth Symp. on the Frontiers of Massively Parallel Computation, McLean, VA (Oct. 19-21, 1992) (IEEE Computer Society Press, 1992) 83-92. H.G. Dietz and W.E. Cohen, A massively parallel MIMD implemented by SIMD Hardware, Technical Report No. TR-EE 92-4, School of Electrical Engineering, Purdue University, Jan. 1992. G.H. Golub and C.F. Van Loan, Matrix Computations, second ed. (Johns Hopkins Univ. Press, MD, 1989). J.L. Hennessy and D.A. Patterson, Computer Architec- ture: A Quantitative Approach, (Morgan-Kaufman, San Mateo, CA, 1990). K. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing (McGraw-Hill, 1984). M. Johnson, Superscalar Microprocessor Design (Prentice- Hall, 1991). MasPar Computer Corporation, Sunnyvale, CA, MasPar Parallel Application Language (MPL): User’s Manual, (July 1993). MasPar Computer Corporation, Sunnyvale, CA. The design of the MasPar MP-2: A cost effective massively parallel computer. D.E. Schimmel, Superscalar SIMD architecture, F’roc. Fourth Symp. on the Frontiers of Massively Parallel Com- putation, McLean, VA, (Oct. 19-21, 1992) (IEEE Com- puter Society Press, 1992) 573-576. H.J. Siegel, T. Schwederski, J.T. Kuehn and NJ. Davis IV, An overview of the PASM parallel processing system, in Computer Architecture, D.D. Gajski, V.M. Milutinovic, H.J. Siegel and B.P. Furht, eds., (IEEE Press, Washing- ton, D.C., 1987 387-407.

Ml

]161

]171

C.C. Weems, E.M. Riseman, and A.R. Hanson, Image understanding architecture: Exploiting potential parallelism in machine vision, Computer, 25(2) (Feb. 1992) 65-68. N.H.E. Weste and K. Eshraghian, Principles of CMOS KLSI Design: A Systems Perspective, 2nd ed. (Addison- Wesley, 1992). P.A. Wilsey, D.A. Hensgen, N.B. Abu-Ghazaleh, C.E. Slusher and D.Y. Hollinden, The concurrent execution of non-communicating programs on SIMD processors, Proc. Fourth Symp. on the Frontiers of Massively Parallel Com- putation, McLean, VA (Oct. 19-21, 1992) (IEEE Com- puter Society Press, 1992) 29-36.

Vivek Garg received his MSEE from Georgia Institute of Technology in 1992, and his BSEE from University of Delaware in 1990. He is currently a Ph.D. candidate in the school of Elec- trical and Computer Engineering at Georgia Institute of Technology. His research is focussed on architectural innovation for high performance SIMD machines. Other areas of research interest include parallel algorithms and architectures, interconnection networks, and VLSI design.

iraduate Fellowship and a Presidential Fellowship at Georgia Institute of Technology (1990-94). He is a student member of the IEEE Computer Society, ISHM, Eta Kappa Nu, and Tau Beta Pi.

David. E. Schirnmel was born in White Plains, N.Y. USA, in 1956. He received the B.S.E.E. with distinction and the Ph.D. degrees from Cornell University in 1984 and 1991 respectively. He has been a visiting engineer and a consultant to IBM Almaden Research Center. Since 1990, he has been an Assistant Professor in the School of Electrical and Computer Engineering at the Georgia Institute of Technology. During the Spring 1991 term_ he was a visitinn re-

searcher at the University of Link@iig, Sweden. He was also a Summer Faculty Fellow at NASA’s Jet Propulsion Labora- tory in 1995. His research interests include parallel computer architecture, algorithms and interconnection networks, asyn- chronous systems, VLSI design, and the impact of packaging technology on systems. He was the 1993 chair of the Atlanta Georgia chapter of the IEEE computer Society. Dr. Schimmel is a member of IEEE, ACM, Tau Beta Pi, and Eta Kappa Nu.

architectural support for inter-stream communication in an msimd system

Documents