nasa contractor report 201599 icase report no. 96.54 ica · nasa contractor report 201599 icase...

NASA Contractor Report 201599

ICASE Report No. 96.54

ICA

BALANCING CONTENTION AND SYNCHRONIZATION

ON THE INTEL PARAGON

ShPhid H. Bokhari

David M. Nicoi_..,_ ,...._._; ,,--

NASA Contract No NAS/-19480

August 1996

institute for Computer Applications in Science arm Engit_eeringNASA LanglQ, Research Center

Hampton, VA 23681-0001

Operated by Universitie_ Space Research Association

National Aeronautics andSpace Administration

Langley Research CenterHwnpton, Virginia 23681-0001 19961021075

https://ntrs.nasa.gov/search.jsp?R=19970011922 2019-02-03T04:01:54+00:00Z

Balancing Contention and Synchronization

on. the Intel Paragon

$hahid H. Bokhari

Department of Flec_rical Erff]in_:ering

Unit,_ rsity of Engineering 1_; Technology

Lahore, Paki,qtan

David M. Nicol

Departraent o/ Compvter .qcieace

Dartmouth College

Hanover, New Harnpshir_

Abstract

The latelParagon is a mesh-connected distributedmemory parallelcom-

puter. It uses an obliviousand deterministicmessage routing algorithm: this

permits us to develop highlyoptimized scheddes for frequentlyneeded com-

m_: nication patterns.

The complete exchange is one such pattern. Several approaches are available

for carrying it out on the mesh. We study an algorithm developed by Scott.

Thi.s algorithm assumes r.hat a communication link can carry one message at a

time and that a node can only transmit one me_age at a time. It requires global

synchronization to enforce a schedule of transmissions. Unfortunately global

synchronization has substantial overhead o_ the Paragon. At the same time t he

powerful interconnection mechanism of this machine permi_ s 2 or 3 messages to

share a communication link with minor overhead. It can also overlap multiple

message travsmission from .'.he same node t,o some extent.

We develop a generalization of Scott's algorithm that execute_ complete

exchange wi_;h a prescribed _:ontentior, .qcheduh;s that i,cur greater conte,tion

require fewer synchronizatiolt steps. This permits us to tradeoff contention

against synchronization overhead.

We describe the performance of this algorithm and compare it with Scott's

original algorithm as well as wi_h a naive ;algorithm that does not take inter-connection structure into account.

The Bounded contention algorithm i:-.,always better than $co_t's algorithm

and outperform_ the naivv algorithm fi)r all but the .'+mallest message :;izes.

"l'he naiw; algorithm fails, to work on meshes larger than 12 _, 12. These results

show that. due consideration of processor interconnect and machine performance

parameters is necessary to obtain peak performance from the Paragon and itssuccessormesh machines.

Re*earth sur,ported by' NASA Contract NAS1-19.t86, while the o.utl,ots were in residence at

the Institute for Computer Applications in Science &; Engineering, NA,qA Langley Research Ce.nter.llarnpton. Virginia.

This research was p,'rform*.d using the Trex ,512 node Paragon operated by' Caltech on behalf ofth_ Concurrent Sup,,rcornputing t."_n_ortium Acc,.ss to this fatality w_ prcwid_d by' NA_;A

Shahnd tt Bokhari w_s additionall.v sul:,por;cd by a grant from the Directorate of IRcs,mr,'l,

Exr.cnsion and Advi-,ory Sprvices. University of Engineering .t: Technology, Lah,)re

1 Introduction

lnterprocessor communication ow;rhead is a major factor that limits the performance

of distributed memory parallel computer systems. All machines, no matter how pow-

erful their interprocessor communicat ion mechanism, suffer from thi,_ overhead. Com-

municaliol_ overhead is exacerbated by node and link conten,'.ion. Node contention

ari_:s when a node attempts to transmit or receive several messages simultaneously.

Lit& contention is caused by tile sharing of a communication link by two or more

messages. Contention arises in all but the simplest communication requirements. In

some cases, contention can be minimized or eliminated by careful scheduling of mes-

sages, tlowever this require.'; that all processors in the system synchronize themselves

at specific points ire time--thereby incurring synchronization overhtod.

The parallel algorithm designer is thus faced with the following dilemma:

• A completely contentiort-free schedule will incur substantial synchronizationoverhead.

• A completely synchronization-free schedule will result in he_',3 contentioxJ over-head.

Clearly there is a need to find a balance t:,etween the two types of overhead in order

to minimi_e the overall e×ecution time of the. parallel algorithm.

The complete exchange is an interpmcessor communication pattern that arises

in a number of important applications. It requires each processor to send a distinct

message to every oi her processor i n the .%vstem and is thus the hea_'iest corn munication

requirement that can be imposed on a p;_rallol computer. ('-mlplete exchange has; been

extensively studied and a number of algorithms are knowz_ for il_ efficient execut;ono_ various interconnection net w_rk_.

We de::cribe a _tudy of the complete exchange on mesh connected parallel ma-

chines. We start with an algorithm to execute the complete exchange on meshes

that was developed by I)avid Scott,. We develop a generalization of this algorithm

that permits us to decrease, synchronization overhead by increasing contention. We

dcsm.lb'"" ,'. our exT,erimeni. ; with this approach on the 512--node Intel Paragon mesh at

Calt,:ch. It is seen that the generalized algorithm cart be used to b_dance contention

al,d synchroniz_.tion overhead and thus obTMn sigl_ificant reduction in the time re-

quired 1o execute the co;nplete exchange. lhe generalized algorithm is also ,.;hown to

give better pedormance thorn a naive algorilhrn _hat does not take the interconnect

of the Paragon into, a,'count.

Our results dernonstrat,: that careful consideration (,f parallel ma¢hitte intercon-

nect and performance characteristics is needed in order to obtain the best perfor-

matice. A,; an extreme example, the naive algorithm (which doe', not tzke the it_ter-

connect into account} fails to execute on Paragon meshes of size iarger than 12 × 12.

because the operating system catanot allocate enough memory for ttte large amount

of commu:|ic_tion traffic required. For such meshes we have t_o the,ice bltt to t|_e an

Figur_1: Th_ mesh interconnect of a 4 x 4 Paragon. The circles represent compute nodeswhile the squares show special purpose hardwar_ ff_r communication. Message routing isdone via the "row column" algorithm explained in the text. "Ihe figure shows two pairs of

processors communicati_g and contending for a _.ingle edge. Such link contention c.an leadto substantial overhead.

algorithm that carefully ,;chedules communicatkms, such as Scott's algorithm or its

generalization (described in thi:, paper).

2 The Paragon Mesh

The mesh ha.,; long been a popular choice for interconnecting parailel r:ornputers.

Currently. the most powerful example of the mesh is the Irttel Paragon t. The spe-

cific xnachine on which the experimen'_ dc:_scribed in this paper were carried out is

]o_ated a*,. the (:enter for Advanced Compul,iitg Research at Caltech _. It i_ made

up of 512 compute nodes organized in a 1.6 x 32 array. Each m_dv is composed of

two Intei i_60, processors. Onc ser_es a.s a ¢ompuie processor and thc othcr as a

communication processor. In addition there i._ special hardware for interfaciug wit h

the intercommunication neLwork. The interprocessot c<Jrnmunlcation network is a

mesh with "ro_-.column'" routing (Figure 1). A messagv traveling from source s to

destination t firs_ travels along the tow in which s lies. un_.il it rea,:he._ tae coJ_lrlll!

in which t lies', ]t then travels alon.¢, the column t,o t. Two messages travelit,g si.

multano:m,.ly between two differen: source-de,4i_xatio1_ pair._ may need to traverse the

same commu1,ication link, as illu,_Irated iT_ lVigure 1, an,] will incur link coT_tenti{m

Ihttp://www.ssd.lnt_l.,zom/paragon.ht_l

_http://w_w.cacr.caXteeh.edu

Node ,:ontw.1. Link cont=:4

e---e-.-- l.---e

Node cone=:4. [.ink cont--8 Node cont=2, Link cont=b

3

Figure 2: Explanation of node and llnk contention on chains of processors. Node contentionequals tile number of messages that a processor attempts to _ransmit simultaneously. Link

contc, ntion is given by the maximum number of messages passing through an), communica-tion hnk in the chain.

overheM.

]:he routing mt_chanism on the Paragon is oblivious (the paths betwee,_ all source-

destina'tion pairs are statically defined) and dtter'm_nistic (a single mute exists be-

tween every ,,source-destination p_ir). As a result, it is po.s,,,ible to azcurately predict

the time required for a communication step, provided no contention is taking place.

A message passing through _. node en route to its destination does not impact

the computation occuring at thai; node as the routing is carried out by special hard-

ware. "]7he i860s run itt 50 3,'[_-_Zarid are cap,,ble of 75 MFlops. [[his machine has 32

Megabytes of memory per node of which about 2.1 .Megabytes are available for user

programs. Measured performance parameters of t,he Paragon are given in /able I.

The communication expression in this table is obtained by using the specific com-

munication scheme employed in subsequent experiments with the complete exchange

and thus differs from the expressions reported elsewhere [2.5].

Table 1: Performance Parameter,. for the Paragon

i Syrlchrorlization ,L x n proce,,,sors ( 271 Io82 n - 13.t #sacCommunication, message m >_._b40 bytes L 231 q. 0.022m _scc !

[:igure 2 clarifies the concept., of rind. at,(I link cot ltention, as applied to chains

of processors. The interl)retat, ion of the._e toncept:_ for meshe,, is very similar though

difficult to e>:plain in a ,,.imple diagram.

The successor machine to the Paragon i_ the Intcl ASCI (A, celer_.ced Strate-

g{c ('ornputing [nitia,.iv-) TeraflOl_'_ [12]. whh:h is currently being installed at Sandia

kab_,ratories "j. 'lhis mech!ne also ha-, i, mesh inferconnc'ct and the tecbni,lues c.l(,

_h_tp://www.tld.in_eX,com/t_l¢,p.ht_l

_http://,_.ct._andia.got/teraflop.htm_

D

P1 P2 P3

(=)

o]

P4

i,,_ll IG!i 'I .,.-===_ I L-,w=._ I

[,-t ..... t

;l J ;; IK_t [-L];

I i r_ [ _ _

IfN, lol

P1 P2 P3 P4

(hi (el

Figure 3: Co]np]et_ E×chanR,e on :I Procesmrs. Ii) change storage flora column order (a)to row ord( ;'(c), each processor must send a distinct message to every other proce_sor (b).

scribed in this papor ,_hould be applicable to the new machine as welt.

¢11

2he Complete Exchange

tan ,.istributed memory par_,llel computer, the complete ezchong_, require,; each of

N pvoo;ssors to send a distinct m byte block to each of the remaining :\: - 1 proces-

sor._. This communication pattern, which is also kJ,own as all-to-all per.sonali::ed, i_.

at the heart of many Lmportant raulticomputer algorithrn_ such as matrix transposi-

tion. matrix-vector m_.fll,iply, Fast Fourier "l'ransforms and the Alternati,g, Directions

Irnplicit (ADI) method for solving t,arTJsl differemial equations. To understand the

data movement required by this pattern refer to Figure 3 which shows a 4 _,:4 block

matrix stored on 1 processors. In part (a) of this Figure the matrix i,_ stored in cob

umn order. In part (c) the layoul h_s been change_d lo ro_ order. It is clear that to

change from (a) to (c), e,ach pro(essor must tra,nsmit a block of data to every other

processor. "[hi._ is shown in part (b) which is a complete directed graph of four nodes.

In gmwral, complete exchange on N procesmr.,: can be represented by a complete

directed graph of ,\: nodes. It i, thu_ th_ den,rest po.,sible communication require-

meni and the time required by a dist]'ib_h d rnemory multicomput, er to execute it

is an important performance parameter. At the same time. it is a challenge for fhe

algorithm de:4gner to develop good algorithm.,; for complete exchange on different

parallel architectures,

A number of algorithms have been developed for executing the complete exchange

,m hypcrc_d,,'s [1. ft. _. g. 11] and meshes !1, 9] l'hese algorithm_ ;_t.t.empt t.o (,I,t.;duhigh performance by carefillly .scheduling communications so as to avoid node and

lin_ conlenli<m. \_,e e_.n classify these algorithms into two categories. In Dtr'eetal_oritl,m.,: each block i_ lransmitte, l once to it_ uhirnate de._tination: in 5tor_-a,d-

forword algorithms a block is combined with others and transmitted in stages via.

intermediate processors. Store-and-forward algorithms [7] strive to reduce the impact

of startup time by incurring data permutation and extra transmission overhead. It

ha,_ Leer, shown that such algorithms perform _'ell fox small message sizes. Direct

algorithms [11. 9], on the other hand, have better performance for large message

1'he time required to execute the complete ex-change will depend on the intereon-

nection network and the schedule of data transfers. We shall address the problem

of developing good direct algorithms for mesh connected parallel architectures. The

sparsity of the mesh interconnect makes this a difficult endeavor. This is in contrast

with hypereubes, for which optimal direct algorithms (i.e.. those that. require N - 1

transmissions for an N processor system) have been known for some time.

4 Scott's Algorithm

The problem of implementing complete exchange on a me._h architecture has been

studied by Scott [9] under the following assumptions:

• A node can send and receive at most one message at _. time.

• A cornmunica'tion link can carry at most one message in each direction at. onetime.

• Messages are routed according to t ht_ "row-column _ algorithm, ghat is. a mes-

sage from processor :r_, y_ to processor .r_. y_ fiu,_t travels along a row to :r_. :u_

and tl',en along a column to :r_. y2.

Scott show,; that. under these assumptions, a square mesh of -Y node:, cannot a c.fieve

the complete exchange in fewer than N'_/_/I steps, unlike a hypercabe, which requires

N- 1 steps. The intuitive reason for this is the far richer iaterconnection of the

hypercube which comes, of course,, at the cost of a logarithmically increasing node

degree.

Scott goes on to describe a procedure that will generate, a schedule of transmissions

that takes exac.tly N3/2/4 steps, for the case where N i_ a mull.ipleof 4. This procedure

is based on composing or "cross- multiplying" pairs of 1-dirnensional permutatioDs and

can lead to many different t_et,; of schedules, depending on the choices made when

composing th,. po.rnmtati,_ns. I"igure 1 ,';l_ow; three permutations out of a set of 128

generated for an 8 × 8 mesh. The cells in this diagram are assumed to be numbered in

row-major order. A non-blank cell indicates the coordinates of the target to which the

corres.ponding processor has to transmit. A blank cell indicates that tl',e corresponding

processor doe.:, not transmit anything during that permutation. As we increase the

size cf the mesh. the proportion of these idle proce',sors increases be('aus,, the me'_h

intercc)nvect cannot support transmissions by all proce,;sors. It is these idle processorst a;_t lead to the !;uperlix:ear .Vs/;/4 express;ion £,r run time.

_j7,47,27._

2,6(2,0

4,6 i4,0

'.____i__ , 0,4!0,2)0,5

] 6,4j6__,&[6,5

!,31 LT_]

{3,713,1{5,7 5,1'

0,3 I____._.___.__

6,3 '

r

_2,1 2,7

17,1 7,7

J l

!Z_F-

,, { _ _--_4,3}4,5 4,2 4.4l

,__J.... i_,_0lz,;_1,311,5! 1,2 t,4{ I

13,_i3,s_'T_.4[---L_--j,--, {5,0 tsTj

L __2,3L-4_Z_s_k___.L_,o

r--1-<-_

--l-q_,__T2,6 2,L { 2,4 {

.... = _, , ,

6,4}6,6 6,1 ;

1{7'7 {7,2

--I 0,7 , 0,2

_,i; I_,4 ! -

Figure 4: Three o,,_ of a set of 128 permutations for ,an _. × 8 mesh. The cells in this

figure represent proressor,_ and a.re numbered in row-major order. At, etnpt.>' cell indicates

an inactive proce_,_or, A non-empty ceil gi'.es tee ¢oozdlnates of tile (eLl to which that cell

transmits,

5 Bounded Contention Algorithm

The permutations generai ed by Scott's procedure assume that only one message can

travel over a link in one direction at a time. As a result all nodes cannot, in general,

transmit during any given step. This is evident ill Figure 4. where we see that. half

the nodes are always inactive. If we h_ve a square mesh of n x n = N nodes, the

mimber of ._teps required is n3/,t =: N3/2/4 and during each step a fraction 4/n of the

nodes is inactive.

If we relax the constraint that a link onJy carry one message at a time, it becomes

intere.,ting to explore if schedules can be generated in which contention is bom_ded by

some integer c. The permutations shown in Figure 4 cannot simply be superimposed

because the active nodes in any pair of permutations are not disjoint.

Scotl's generation technique crea_.es permutations that can be executed in any

order to achievc the complete exchange. The ._et of permutations generated is not

unique. We have de,:doped an algorithm to generate a set of permutar, ions in a special

collapsible order. This generates permutations in such a way that consecutive entries

in the sequence ¢ai_. be collapsed to form a denser permutation /i.e., one in which

more node-,; are active), with gre_tter contention. The collapsibility property is not

true of Scott:s permutations in general.

Figure 5 shows two permutations for an 8 × g. mesh that can be collapsed to form

a third. Since each of the constituent permutatiov.s has link contention bounded by

1. the contention in the collapsed permutation i.q bounded by 2. It is also clear that

each node is transmitting exaclly once.

l%r the S × 8 mesh shown in Figure 5, the fraction of active nodes in the constituent

permutations is 4/n = 1/2. We can combine ,,ets; of two permutations each and thus

halve the number of steps required to achieve complete exchange.

We have developed a theory of collapsible schedules for the complete, exchange

on meshes. We can show that for a square mesh of n × n := :V nodes that permits

contention c on its links, the number of steps required is h3/4c, where c is an integer

< n/4 and c divides n/4 (i.e.. r,/4c is art integer).

We have implemented an algorithm based on this theory and used it to gem_raleand verify" ,_¢hedules for meshes of ._ize 4 × 4. $ ,_ 8, ..., 32 × 325. Table 2 show._ the

improvement possible a_ the permitted contention is allowed to increase. For each

mesh size, the' minimum steps possible are n 2 at c -- u/4. This is within I of the

theoretical minimum n 2 - 1. The blank entries bel_,,w the principal diagonal in "['able

2 are caused 16" the constraint that 1_/1c be an integer. Thi_ table assumes _hat no

node contention is permitted, i.e.. a node cannot transmit more thaw one me_sage ata time.

'lhe ,<hedule.,. generated by this algorithm have ! he inleresling pr_)pert,_ that theycan be ,.ollapsed to whatever degree is permitled by the rules stated above. Thus the

schedale for 16 x 16 meshes could be collap,,;ed for lit& contention 2 or 4 by covnlfini_lg

Lq.ehedule_ for m_,_h_s ofsize 'I × '1, g × g I l :_ ) , 12 and 113 * 1t3 are available at th_ follos_'ing sil_:

:p: l/f tip. tease, adu,/pub/c $/8 hlthl, d

{1,1_ 1,7

T'_I_'7-'3,3{ ' 5,3

_-------- 2,3

_o,_io.-f-_-3-

-_ ___

3,s ] s,2 i3,_.S,s I s,2 i 5,42.si 2,2i2,4

----F-

1,0 1,6

_,o17:6

L2,_.1_2,_2_6

L--'_'t_li,_' _,_-'Y_! 7, tt--_7,7 17,3 [7,5

i_-_ _-:__,____:±I_,_5,1 _5,7 5,3 _,5,5

2 1 2,7,

14,1 4,7

io,I 0,716,1 6,7

2,3_,s4,3 ]_,,5

0,3 I0,5

6,3 l 6,5

1,41,0 1,67,2 7,% 7,0 7,6

3,2 3,4 3,0 _',6

5,2 5,4 5,o 5,62,2 2,_ 2,o 2,64,2 4,4,4,o i _,,60,2 _,_ i o,o o,60,2 °,_i°_° o,6

Figure 5: The first two permutations; cat, I)v c¢;llap_ed to fi)rm the third. This is: possiblebecause th,? active cells in the, first perm_tation correspond exactly to the inactive cells in

the mcond and t'ic_ t'erm. Since the ]ink contention it_ the first two permutations is 1. the

rombined t,ermutadon ha_ l_r,k ('otl_.entiozt 2.

Table 2: Steps req,ir¢,d as contention is Mlowed to increase.

[;, l-'ermitted l,;nk Contention (c) ]

(0 × .) [__---_,---]=g----[4 15 16 LL_2_._____jMesh size

4 *: 4 16 { '

s × s 12s 64 ! }I

12 × 12 432 144 {

16 x 16 A 1024 512 256

207__1 -¢dN 1ooo ' 400

s2×s2l! s,o2{4o9 i{ 12o,,s

576

784

{1024

consecutiw, sub-sequences of 2 or 4 permutations as shown in Figure 6. If the first

synchronizatkm in part (c) of this figure were removed we would have a _chedule with

node as well as link contention. Two nodes would be attempting to transmit at a

time while the link contentien would be doubled from 4 to 8. This can lead to further

improvements in run time. as described below.

6 hnplementation Considerations

The nx message passing library was used for our experiments or,the Paragon. This li-

brary has its origins in the Intel iPS('--860 hypercube which has two types of messages:

FORCED arm UNFORCED.FORCED messages are transmitted from source to destination

under the assumption that a receive has already been posted (i.e.. buffet" space for

reception has been specified) at. the destination. If an arriving message does not. fi'.d

a receive posted, it isdiscarded. UNFORCED messages do not require a receive to be

posted beforehand. Before an UNFORCED message is l,ransmitted there i.', an exchange

of control messages between _urc¢. and dest inatic,n to allocate %,erating system buffer

space for the message. This leads to addilional overh.,.ad in communicatk)n (because

of the cont.rol mes,:ages)., extra memory requiroments, and the penalty of copying

from operating system buffers to user _xreas [3]. Further details of the communica-

tion overhead on tbc. Paragon appear in [2]. Shirley et al. [10] discuss how operating

system limer in,,-rrupts complicate performance measurement and prediction on this,nachi,,_-.

On the Paragon, FORCEDand UNFORCEDmessages are supposed to perform iden-

tically. It b,a_, been our experience that operating system space is allocated for all

porsible arriving me.s_ages in add,tion to any user mentors locations that may l,e

set a:,ide by explici_.ly posted rcceive_. "7}_e user can ,_pecif.v the amount of memory

buffers that t.l,e ,Jpe.ra_ in!,_ system is to set aside for this purpose. Despite this, when

lar_, number_ of htr_e-sized tnes.sages are expected, the, op_,ating s._ aem _.'a,, r,,n

lO

l===,II.=.,l

II II_Syne.

I'I',I.N.'I" I

II II_,--"S:)_Ic.

Illl'.",.'l I,

I:II:lII "-I

iiIIilII I

|I II

I II I_Sync.

:0 ':":IlJS

:,:, ,:,:I'11"1

_S,_c,Ill u

IiI I iMi II

..I II l.,I.I .|I

I II Iill I Iii I

1I .=:_Sync.

(c)

Figure 6: :a) The first 8 {of 1024) members of a <'ollapsible schedule for a 15 ×lfi mesh.Active nodes are indicated by square blocks. When :alternate synchronizations are removed(X). pairs of successive permutatior, s col]ap_e as ,hc)wn ill part (b) giving a schedule, withmaximum link contention 2. Repeating this pror,,_ ,'e_Jlt.,_iu a schedule with link con,.nn,lont {¢). lrurther removal of _yachroniTation slep_ res.:!t- in incrtasing node contention.

11

out of resources thereby causing the machine to hang. Needless to _ay, FORCED mes-

sages should only be used if communication requirements are well understood and

receives can be posted before an:, messages are launched. Deadlocks can develop if

this requirement is not satisfied.

The Bounded contention complete exchange algorithm that we have developed has

a completely determined communication requirement and we could thus use FORCED

messages. To compare the performance of the Bounded contention algorithm again,st

an algorithm that does not take the topology of the mesh into account, we imple-

mented a naive algorit,hm to carry out the complete exchange. "Ibis algorithm simply

tran'mdts block, of data from each processor to the remaining processors without

regard for link or node contention. We were unable to get the naive algorithm to

function reliably beyond 12 >,.12 precessor._ because "the large nurr, bers of outstanding

receive,; required couM not be accommodated by the operating =,ystem.

Each node of the Paragon has an i860 processor dedicated to ih,erprocessor com-

munication. This processor takes over a considerable portion of the overhead of

starting a data transfer. _,_ haw: found that a._.ynchronous receives and sends vMd

much better performance because the compute processor can spawn a task on the

corr'.,munication processor and carry on with its work without having to wait for the

_,peration to complete, q'his, in fact. is how lhe machine manages 1o perform wellunder node contention.

Memory access and t hu:; data communication on the' Paragon is heavily affected

by the starting addre._s of a tran_ffer. In our experiments we have aligned all arrays

to, 4k boundaries fthe page size of the rnachinel to reinimize t,his impact.

7 Experimental Results

When implemer,ting Bounded contention complete exchange on the Paragon..,everala-:peers of the machine performance had to be taken into account.

1. The amount of contention in a schedule can only be controlled by global syn-

chronization. _Ihe o', _'rhead of this operation is substantial (Table 1).

2. While the machine can tolera_,e node and link contention, there is non-zero

overhead associated with such contention.

Overheads for nod,', and link contention are heavily dependent on the ty'pe of

communicat ion bet ng car]'ie,:l oul. It is very difficult to obtain simple exprt_;sions

h_r the._e overheads. I'or example, measnrernents taken of the 1-dimen:;ional

communication patterns in Figure 2 do not apply to 2-dimensional communi-cations.

The above a_pt'cl_, templed wilb the use of virtual memory on the machine and

t[w _omplex effe.cl: _,f operating ,:.vsT_-m interr_lpt_" '10] make it extr,_mely dif_ic_dt Io

predict the communicatmn performance of this machine under varying amounts of

12

Figure7: Yniw algorithm("+") compared_ilh Boundedcontentionalgorithmona4_ 1Paragon.Thenaivealgorithmrun times,whichdonot vary_'i)}_conte,tion, ha','e beenshown as a _eries of s_rips for clarity,

node and liItk ('ontention. This in turn also make_ decision of the lev(:l of contention

to t,e used difficult.

Our approach is to evaluate the algorithm k)r various levels of permitted con-

tenLion and empirically decide on the best level for a given mesh size. Thi_ is e_sily

done (),n(:e a collal_sible sequent(, has been generated for a mesh: _irnply insert bar-

rier synchronizatio,_ in the sequence, modulo the permitted contention. "I'hu'.;. for a:$2 __ .32 mesh we would in!_ert barriers after every I. 2, 4 or !._permutations. For ex-

ample, inserting barrier,_ after every 4 peruv,>.t al ions causes each group of 4 to collapse

into one permutation with contention 4.

Figures 7, 8, .q arid 10 compare the l.erformance of the naive and Bounded ¢on-

tentiou a]gorithm_ on meshe_, of ,dze ,1 > 1, _ :,_8, 12 × 12 and iG × IG respectively.

for varying amounts of contention and rues.sage sizes. The.. z-axes of these plots are

labeled with the pair:_ (node contention, link contention't, as clarified in Figure 2. The

performance of the rtaive algorithm, which does not vary with r(mt.eration, is shown

_s _ series o,[ strip._ so that the surface of the Bout_ded algorithm can be seen clearly.

The small _dz(, of the 4 >,:4 mc.,J_ rl(_,s ,or [.ennit a collapsible scheth_te t<) he

generated (see, ]'able :2). Despite thi:,, there is _tu hnprt_vement in performance ascontention i,c.rea.,es, because the number of >ynehronization steps required i,; reduc(,d.

Furthermore. nod," contention also res,h,, it) slight decreases in :ime a2.:launching )wo

or more ine,._ages in quick succession permits the utilization of ;nit'anode parallelism

cbJ¢-' ', a :_el',arate colnmutJication processor.

_,urc_s ,_ and 9 _how much more interesting results obtained from experiment.-,,n _ " _ arid 1'2 _ 1'2 me._hes. IIero. the perforlnal_ce of the Boundod algorithm

13

Figure 6: (-:ornpari_on of tho _wt_ algorithms on an __× 8 Paragon.

Figure 9 ('orltpari,.r,t_ r_f the two all_orilhrns c)n a ]2 _ 12 Para_or.,

14

16,64 _ SO(X; .... "-_OOe Itrtk oo_eQ_ion 256 1 _24 v

Figure i0: Performance of Bounded contention algorithtn on a 16 × 16 Paragou. The

a_ix'e algorithm fails to work on this mesh and the Bounded algorithm fails at con:.ention256.1024) because of operating system limitations,

]s i,titially much poorer thai: the naive algorithm bur. improves ve, y rapidly with

increasing contentio,. The. initial steep drop i.,, due Io the col*apsir_g of the schedule,

(v,-hich incrcasos llnk contention but not node contention) an,_ ro the large reduction

in synchroniza_ iott steps. As contention it_(:rea,_eq, further _mprovements are obtainedbecau,._e of reduction in ,_ynchroniza_,iol: and because of the concurrent operation of the

cort_munic_:lio_, processor. However the improvement is arrested at node contention

= 16 whan the decrease in syn(:hronization sleps: can no longer offset the overhea<t

due to node and link contention. After_ this point the time starts increasing.

The performance of the Bou_ded algorithm for 16 x 16 meshes is shown in Figure

10. The Paragon failed to execute the naive algorithm for this mesh size. 'i'his is

be(au,;e the operating system could not _,llocate enough resources to accommodate

the 25fi rec(,ivcs requir,.d by the algerithm. The Bounded algorithm itself could notbe be tested for thi_ mesh _ize for node contention ----256 for the same reason.

The relative performance of the two algorithnts is clear irt l"igure 11 which shows

contours that indicate the percentage improvement of Bounded over naive. "lhese

,:ontours show that improvement s of greater than ',.)Se/_.a r_ possible on 8 × 8 and t2 × 12

nte,',h_'s for too,;* rnessage sizes, l)ro_'ided th, c()ldPr_tior-t level is chosen ci_refully. The

contours help us pick the ;.)est c,ot_t.ention level for a given message size.

q'o study our _'xl,erimcnt_d results in greater detail we pzovide ._lices. at mes:_age

_izc 1523'2 hyl._, through the surfaces of Figures 7. $. 9 _: 10.

The soli<t curves in the'.)e figures _h,,_': I,),_ _ea._ured time to execule Bou,d-d cor,-

_et_tion c,m_l)lete exchange. Thi_ mea_ur(,,l _i_ne i_ coml_ared with tl_e pr,_licted time.

obtait_ed by addin_ :_ynchronizatio_l anti comn_unication time: t;_k_', fror_a "!able 1.

15

l

• ".,. i

• . _%

3C' O0

25OOO

2OOO0

" "- - -.. rr_ss,_ge le_glh (byles)............. 15000

/

• 10%1OOOO

. ,- 15%

'" i" _-- ".... 20% 5000

t 2,2 4.4 $.e 16.16

r_de lirtk coq_enl_on

15 20 25%

(;'. . !,

.,-

i" ":-, .....', , , ,

• . ...

8 x 8 m_$h

11_ 1.2 2.4 4.8 B.16 18.32 32.64 64,128

3O0O0

250O0

2O00O

r_essa_e ,length,(Dyles)15000

aOCO0

5O00

noc_ link cor_tntio_

20% 2fl%

....' ' ,

: :i_ . : :,

___ _.._.___._...___.._..,.__ - - _ .

1 1 113 2.8 4,12 8,24 16.41_ 48,144 14.4432

30%

. ..--

3OOOO

2_00

20000

mess_ J_glr_ I_e es)1500O

IOOOC

5o00

_orle. ll_k COr_OI"_IOf'I

Figl]r_ l} ('onlm_r pl¢;;,_ of per(:Pntagr improvement provided b)" klo_nded contention

al_ori_b_, _,ver ,aive Mgori_hm for , × .I, a x 8 and 12 × 1V Para_arm _neshes.

II , m i I

16

!

00_

0.015

0_1

4x4 mesh Px_Je,aO4,adze 15232t>ytes

kQiIS ¢ r"toted

/

1,1 2.2 4,4 8i8 16,16

Figure 12: A slice thro':gh ll:e surface for a 4 _ 4 Paragon.

Figure 12 shows a slice through the surface for a 4 × 4 mesh (Figure 7). Three

as[.ects of this fignre _tre notewor_h.v.

• The agreement betweez_ predicted and meas_red titans i:_ good.

• "I he comrr|unicat.ion time fractio_ of the total predicted time is ('ol,stant. This

is beca, se in a 4 × 4 mesh schedub" there are no idl,e processors. Thus. even

when we increase permitted coI|tention, the schedule cannot collapse because of

the lack of _'t,oles" in the permutations.

• The iIlcrea._e in per_orm_mce comes about because of reduction in syr_chroniza-

lion overhead.

The sliee of the 8 × 8 surface (Figure 13) bring,, out several interestizig iss,es.

1o circumvent the difficul.ty of prer.lic*in_, l),-rformance we have insertod upper andlower bounds for time to e×ecu_e complete exchange in this _.nd subsequent figures.

I'he lov,'er bound ._ive_ the ,;urn of comrnunicati(_n _ad synchronization times _s givenir_. T_ble 1. Note that the ('omrnunication time is ha!red going from link cot|t_,nrion

1 to 2. This is because, a,_ sho_'n in Table 2, the number of communication steps

drops from 128 to 64 for atx 8 x ,q mesh. Since the lower b_,und does not include theoverheads of node and link contention, the mea_ur,_d time shonld not drop below this

CllI'V_.

O._S T

"?_

0 ° 4 " '_'

.",.

0.12 -",\

O. 1 "i",

0.08 . "_

"'. i0.06 ", .

0.04

_.02

0

1.1

mp_urecl

"'.---_ ................................................ /-"......':: ::...;_.....v.. :. ......

' ¢,_nn_mcglj_n

1 2 2 a 4.8 8.16 _6 32 32.64 64, 12gr'_,_e, li_k c:Ol'ltorllfo_

Figure 13: A ,;]ice through _he s_rf_ce for an 8 x _ Paragon.

17

The Bounded contention algorithm increases permitted contention by' deleting

barriers. This removes control over the launching of message,s: a processor c,-n fire

off" the next message in its schedule without waiting for synchronization. Some mes-

sage:, may be launched along paths already irn u.,_e, thereby increasiug contentioii.

"Ihe impact of this contention is very difficult to estimate because the communica-

tion pa_ terns of the Bounded algorithm are complex and their contention cannot becharacterized simply.

The upper bound curve give:_ the sum of synchronization and communication

time_._, assuming that all 128 message ,:teps are executed serially. We would expect

t}Je rrwasured time,, to li_, betweeji the tw¢_ bounds. The c!oser the zncasured time is

to the lower bound, the greater is the success of the Bounded approach. On the other

hand, the mca.sured curve would approach the. upper bound when the contention

overheads exceed the reduction in. communication and synchronization time.

In Figure 13 we see. that the measured time is close to the lower bound for link

contention 1. 2 _', 4. Beyond ,1 the measured t.ilne starts dev]atiHg significardly,

reaching a minimum at li_k contention 16. Shnilar comments apply to the slices for

12 _ 12 and 15 x lf, meshes (Figures 14 & 15). In the latter it is noteworthy that the

measured time almost touches (but does no_ cross) the upper bound at contewion

(128.51'2). (Recall that. thi_ eXl)erimem could not be run for the lt_t contention value

of (2:)6.102,11 becau._e of ow, rating _ystem lirnitaticm_.) This sl,_w,; that o;ir algozithrn

i¢ rol_tJSt i]_ _,he sen:;e thai the measured time remains })o_rMed by the time to execute

the individual communication steps.

iiii i i

18

07

I_b,12 mt_l_, m_e slz+i 15232 b,/l_

05

0.4

0.3

0.2

Ioi

1.1

%

\..\

+,.

+'." wpl_, I_Und

..,\ ..... _ ................

"-,_ romza_o_'"- + .... _*er bound

"..-.;. ............................... :'.'.. -.:::..; .._ ._:._.:+,+_.. ...................

-p

i

1,3 2,6 4 12 8.2,4 16,48 411,_44 144,43_I_o_l. kflk (JDnllr'qlon

Fi_uro 14: ._ slice thro,gh the surface for a 12 _, 12 Paragon.

16xl II _tr, m)ua_e 8i7e 1_,.'_12 byt.

f------- , , '5"'

16!\----v'-'_14. .

O.g \.

'\

0_ -

04

O2 I

I

12 14 28

" • . Inv_ou_l".,.,+

•_,',. ....................... . +' +...::- ...;;...,_,._,,.._,_o ...............

4.+6 1.32 I(I(14 3i.+i+ 64.;';,6 '_211J12 2SILI024

Figure 15: A +;lJ+:'_,through _he suffo.ce for a 1¢i ,: 1_3Paragon.

19

Fig_lres 13. 14 and 15 show that. a careful choice. .o,. ention levels is necessary toobtain the best performance. It. is not enough t.o blindly remove all synchronization

steps.

8 Conclusions

Complete exchange is an important comm_mieation requirement that is difficult to

execute efficiently on meshes. We have developed a new Bounded contention algo-rithrrt that. takes advantage of the high performance communication mechanism on

the Paragon to achieve good timings. 'fhe performance of this algorithm ha,s been

mea,'_ured to be better than that of a naive algorithm that does not take network

topology into accotmt. Our experience appears to contradict the commonly held be-

lief that topology does not have to be considered when designing parallel algorithmsfor modern parallel computer systems.

Our results are applicable to all meshes in which, like the. Paragon, the rate at

which data ce.n be transmitted acro,,_ the interconnect is higher than the rat, e at which

data can be injected into the interconnect. The successor to _.he Intel Paragon is the

ASCI "Ieraflop machine with a dual mesh interconnect [12]. '[his machine (:an take

advantage of our results in an interesting fashion. Our algorithm essentially "slices _

the complete exchange communication pattern into a ,serie_ of sub-patterns, each

wit,h a bounded contention. These sub-patterns can be alternately assigned to thetx_o meshes permitting us to take full advantage of the A SC !',: _owerfu] iW,erconnect.

These results are also applicable to 3d meshes because Scott ,. basic algorithm canbe extended to higher dimensions.

An interes;ting area of further research would be t.o combine the Bounded algorithm

which i,_ optimal for large message sizes., with the multiphase algorithm [4] which has

been shown to [',e applicable t,:) the Paragon [,5], and gives the best performance forsmall mes,;age sizes.

Perhaps the most crucial ,:onclusion to be drawn from our experiments is the

importance of synchronization time in determi_fing the overall execution time of a

communication step. Our results indicate that inve._tment in an improved synchro-

nizatior_ mechanism, perhaps relying on a network distinct from the network used for

dar, a comnl||r|icatior|, wollld yield hand.,_omc divide:M,; in terms of imptoved commu-nical iox_perforn ,ante.

Acknowledgments

We are grateful to Manuel Sala:. Director ICASE. for his ertc¢|ure_gement of this research.

Discus_ion_ with 'l'om Crockett. Pavrl Fischer. David Keyes. Piyush Mehrotra, J.hn %'ar_R(_'ndale. St,'v_ Sf,idel. and Xia'a-lle 5_1n have been very valuable.

m i

20

Access to the supercomputer_ at Caltech was arranged by Paul Messina. We wish to

thank him and his able staff: Walker Aumann, Bevan Bennett, Sharon Burnett. Matthew

Carle, Clark Chang. Shay China, Alex Leung. Jan Lindheim, Heidi Lorenz-Wirzba, Julie

Murphy, Mark L. Neidengard, Gary Dell'Osso, Andrew Sun, and Elsa Villate, for their

alacrity in answering our queries and for their patiet_t tolerance of the numerous crashes we

caused on their machines. Thanh Phung, ,M Bessey and Ellen Delegane_ of lntel helped us

with various problems on the Paragon.

References

I1] S. ti. Bokhari and S. Berryman. Complete exchange on a circuit switched mesh. In

Prec. Scalable. High Performance Computing Conf., page_ 300--306.3.992,

[2] Shahid H. Bokhari. Communication overhead on the Intel Paragon, IBM SF2 and

Meiko CS-2. ICASE Interim Report 28, NASA Contractor report 198211. September

1995. ht'cp://tmw, icase, edu/docs/hilites / index, interim, html.

[3] $hahid I[. Bokhari. Communication overheads on the Intel iPSC-_60 hypercube.

ICASE Interim Report 10, M_y 1990.

[.t] Shahid H. BokharL Multipha--.,e complete exchange: A theoretical analysis. IEEE

lhTnsactions or_ Cbmputer$, 45(2):220-229, February 1996.

[5] '3hahid H. Bokhari..Multiphase complete e.,'change on Paragon, SP2 and CS-2. IEEE

Parallel and Distr," uted Technoloyy, 3(4):45 .59, Fall 1996.

[6] C-I'. He aim M. q.'. Raghunath. Efficient communication primitives on hypercub_. In

Prec. 6th. Con]: Distriboted _$*morg Conc_Jrre_t _bmlattter,*. pages 390 397, 1991.

[7] S. Lennart Johnston and Chlng-Tien He. Optimum broadcasling and personalized

communication ]_ hypercubes. IEEE 7r_n,_. C'omputer_, ('-38(9):1749-1261q, Septem-ber 1989,

t._] T. Schmiermund and S. R Seidel. A communication model for the latel iPSC/2.

Technkcal R,:port (:S-TIt 900_. Dept. of Cornpttter Sdence, Michigau Tech. Univ..

April 1990.

(9] D. S. Scott. Efficient all-to-all t o,.nm_nication patterns in hypcrcube and mesh topolo.

gie:_. In Frae. 6th. Conf, Distrib_sted ,,llemory Concurrz_t (bmputers, pages 39_1-403,19_11.

[10] Hazel Shirley. Robert Reyrtolds. aml Steve R.. Seidel. Communication on the lntel

Paragon. l'_chnical Rop,rt CS-'1"1695-07, Dept. of (.'omp,ter Science, Michigan Tech.

t'ni_'., July 17. 1995.

[11] R. "lake. A rontillg me*hod for the _ll.to-all bur_t on hyperrulw uetwork. In /'roe.

•3Mh. No_ionM Conf. Into. Prec, .qoc. dcipan, pagf'_ 151 -152.1.9._7. I_1 Japanese.

i12] Tom Thompson. 'The world',_ fa:,te_t computer (for now). 13Vt¢, 21(1}:62..lanu_ry,1996.

.., I

F_ Appr_dREPORT DOL_ .iENTATION PAGE o_,_o oz_o:es

i| II

_llhe rql_llll_ s I_rlk,_ |e • I_ s C(_le¢l,on liq ,rr(_ll_er el i_lm41¢tci Io Ivenllie : h_r Mr rmpo_,4e. ;rt¢h_d_| the ll_r_l 1or r¢lf,¢w_l_|,nf trdctNO_t, telrch;n|ex,ll,n s dSl_ I_ jutes

O, UIWS fS'J|i_m) r. _U,te %_k_4, Athnrr_n Va _)_'_.2-430_ ). lif_ll _o tlHI 0J_1¢¢ 0t M|_llr¢mcql 8f_l _lJ_S¢l va_rtt4_ll _qll4_UCll@fl Yr_ ll;I_4._)1 |_J YYlil_lflfIgfl. IJA. 6'_D(._

1. AGENCY USE ONLV(L*J_/_nk) 2. REPORT DATE S. REt'ORT TYPE AND DATES COVERED

A_l_ust 199f Contractor _e_ort

4_ TITLE AND SUBTITLE S. FUNDING NUMBERS

BALA 5¢CI._G _,')N3 E'¢TIO._"._,ND .qYNCH RONIZATIO]X"ON THEINTEL PARAGON

I

i6. AOTHon(s)Shahid E. Bokhari

David M..Xicoi

7. PERFORMING ORGANIZ;_ION NAME(S) AND ADDRESS(ES) ""

institute for Computer Apphcat_ons in .Science and Engineering

Mail Stop 132C, NASA Langley Re*_arch Center

Hampton. VA 236_1-0001

9. SP'ONSORING/MONITOli_G AGENCY NAME(S) ANO AODRESS(ES)

National Aer_na,.tics and Space Administration

Langley Research Center

Hampton, VA 23851-000]

II

It. SUPPLEMENTARY N()TES '"'

L_mgley Technical Monitor: Dennis M. Bu._hnellFind ReportS,]hmitted to IEEE Parallel _ Distributed Technology.

]1 II J II _

I2a. DISTRIBUTION]AVAILABILITY S_ATEMENT

C IqASI-] 9480

WI." 505-90-52-01

S. PERFORMING ORGANIZATION

REPORT NUMBER

[CASE Repo.,t -_o. 9_54

10. SPONSORING/MOM_ORINGAGENCY REPORT NUMBER

1_"ASA CR-20159'_

ICA_E t_eport No. 96-54

,m • i

12b. DISTRIBUTION CODE

1.;ncl.assified- L'r:lim it ed

Suh_ct Category 60.61

i ] ii ii I

13. ABSTRACT (A'_a_-_um 2C9 we,as,_

The late! Paragc,n is a mesh.connected di_t.zibttted memory parallel computer. It uses an ob_ivioas and determinist.to

';,cssage toa*.ing algcrithm: th:, permits us to develop highly optimized schedules for freq,ently needed communi-cation patterns. Th," complete .-xchange is one such pattern. Severa_ approach,._ ar_ av.xilahle for carrying it out on

_he Iru.'sb. VCe st.ady aa algorithm develop(_l b.v Scott. Th= algorithm assumes that t commnnication link can carry

one mess.age at a rime and 1hat a node can only traumit one :nes:_sge at a time, l_ require._ global synchroniza-

tion r.o enforce a schedule of transmis.qions. (: nfortunately global synchronization has substantial overhead on the

Paragou. A_ the same time the powerful interconnection mechanism of this machine permit_ 2 or 3 InessaBes to share

a comm,nical.ion lir, k with minor overhead. It can al_ o_'erlap multiple message tramsmix, ion f_om the _tmt' node

_o some (.xtent. We develop a generali_a_ion of See, st's. algorithm that executes complete excha,ge with a l,re_:ribed

contention. '_'hedule, that incur gzeater contention require fewer sync_onization steps This permits us to tradeoff

¢ontemion agair,,_ _ynchtonization overhead. We de_cribe the performance of this algorithm and compare it with

Scott's o ngi,al algorithm t_, we_ t_ with a naive algorithm that doe_ not take intetconnection st tucture int_) arcc, unt.

The Bounded contenrir_n algorithm is always better than Scott's algorith,n and outperform_ the nsi.'e: aJgorithm Jot

all b,t the small,at me,,_age sizes. The naive dw,rithm fails to work on me_he_ larger than 12 × 12. "l'hese re.,ults

show that d_l¢ consideration of processor intercom, nets and machine performance parameters it necessary to ()bta_/n

peak performance from the, Pal agon and its successor mesh machines.

± lJ I I I I pl I

14 SUBJECT TERMS

all-t¢_aJJ pe,li(,n&h_c:d; complete exchanl_e: lmel Paragon: ¢ olnm,nir_tion

hnk contention: pazaY,,I <,mp,_ing. tvnchronzzation', mesh

17. SECURITY CLASSIFICATIONOF RI'PONT

I.;nclassifi_l

sN is'd o=-_,os:oe

III I

Ill SECURITY CLJJ_SIFICATION 19. SECURITY CLASSIFICATIONOF THIS PAGE or ABSTRACT

U ._:las_i fied,._, ,,

II

I$. NUU:=ER OF PAGES

lO, PRICE CODE

'20. LIMITATION

OF ABSTRACT

Stsnd_d Feem _H(Rov. |-BY)P,zocr,l_e_11_yliF:_;.Jbad Z_q 1¢

nasa contractor report 201599 icase report no. 96.54 ica · nasa contractor report 201599 icase...

Documents