multi-millionparticle molecular dynamics - biurapaport/papers/93b-cpc.pdf · multi-millionparticle...

ComputerPhysicsCommunications76 (1993)301—317 Computer PhysicsNorth-Holland Communications

Multi-million particlemoleculardynamicsIII. Designconsiderationsfor data-parallelprocessing

D.C. RapaportPhysicsDepartment,Bar-han University, Ramat-Gan52900, Israel

and

SupercomputerComputationsResearchInstitute, Florida State University, Tallahassee,FL 32306, USA

Received10 April 1992; in revisedform 1 July 1992

This paperdescribesan implementationof a parallelmoleculardynamicsalgorithmon theCM2 ConnectionMachinethatis designedfor large-scalesimulations.Themethod employsa cell subdivisionof the simulationregion,and is partly basedon the layer approachdevelopedfor vectorprocessing.All communication is betweenadjacent processingelements,eliminating the need for global communication.Performancemeasurementswere madewith systemscontainingover 106particles.

1. Large-scalemolecular dynamics simulation

1.1. Introduction

In this,thethird in a seriesof papersdealingwith algorithmsfor large-scalemoleculardynamics(MD)simulationthataretailoredto moderncomputerarchitectures,we addressthe issueof data parallelism.The first paper[11dealtwith the specialrequirementsof vectorprocessing,the mostwidely usedmeansfor computerhardwareto achievesupercomputerstatus.Thesecond[2] discussedthe useof distributedprocessingin a message-passingenvironment;machinesof this classare traditionally known as MIMD(denotingmultiple instructionmultiple data),andthe parallelismis manifestedat the executionlevel ofthe program.

The presentpaper treatsan alternativeform of parallelism that is expressedat the data level —

generallya morefine-grainedparallelismthan MIMD; suchmachinesarereferredto as SIMD (singleinstructionmultiple data).While theevolutionarypathfor high-performancecomputingappearslikely tofollow the MIMD route, machineswith SIMD architecturearewidely used,andimplementingcomputa-tions on machinesof this type not only presentsa challengewhich mustbe addressed,but oftensuppliesnew perspectiveson familiar problems.

The remainderof this sectionprovidesageneraldiscussionof the themesdevelopedat lengthin thepaper,namelyMD andthe needfor computationsof ever-increasingsize,the architectureof supercom-puters,and the intersectionof thesetopics.Parallel computingis discussedin section2, with emphasison thosefeaturesof the Thinking MachinesCorp. CM2 ConnectionMachinewhich areimportant forthe MD implementation,and section3 summarizespreviousrelatedwork usingthis machine.The new

Correspondenceto: D.C. Rapaport,PhysicsDepartment,Bar-IlanUniversity, Ramat-Gan52900, Israel(permanentaddress).

0010-4655/93/$06.000 1993 — Elsevier SciencePublishersB.V. All rights reserved

302 D.C. Rapaport/ Multi-million particle moleculardynamics.III

methodis describedin section4; the key featuresarecoveredin sufficient detail that MD practitionersfamiliar with CM2 basicsshouldbe ableto reconstructthe programwith little difficulty. Performanceisdiscussedin section 5 and comparedwith earlier work. The paperconcludeswith remarkson parallelprocessingandthe preferredmachinesfor large-scaleMD simulation.

1.2. The goalsof large-scaleMD

MD simulationhasprovedto be an extremelyusefulapproachto studyingmatterin all its forms[3—5].Each simulation amountsto following the trajectoriesof the particles in the systemand measuringvariousphysically meaningfulquantities— in essencea numericalexperimentis performed;the inputstocomputationsof this kind are the specificationsof the structureof the individual particles(atomsormolecules),their interactions,andthe initial andboundaryconditionsof the system.

The amountof effort that mustbe investedin an MD calculationdependson the type of phenomenabeing studied.A greatdeal can often be learnedfrom relatively small simulations— studiesof simplefluids in equilibrium areanexample.However, theredo exist classesof problemsthat, to be studiedin ameaningfulway, require simulations involving substantiallylarger systemsthan thosegenerallyused;examplesinclude studiesof incommensuratesurfacephases[61, the natureof spatiallyorganizedflowpatterns[71,andfinite-size effectsin homogeneousnucleation[8].

Justas the definition of a supercomputerchangeswith time, what constitutesa large-scaleMD studyalso varies. Presentday studiesof phenomenathat can be understoodin terms of simple molecularmodelsbutwhich require systemsof largespatial extent can involve betweenseveralhundredthousandand a million particles.At the otherextreme,small systemsmay consistof undera thousandparticles.While large simulationsare capableof providing information that is quantitatively more precise thantheir smallercounterparts,the point that shouldbe kept in mind is that the information can at times bequalitatively different, in that entirely new effectsappearoncethe systembecomessufficiently large.

1.3. Supercomputerarchitecture

Many architecturalenhancementshavebeenintroducedover the years to coaxhigherperformanceout of computerhardwarethat is fundamentallyclock-ratelimited. Many of thesedevelopmentsarehidden from the software developer,but a numberof the key hardwarefeaturesare of a kind thatcareful programmingstrategiescan use to advantage.Some of thesefeaturesare recognizedby thecompiler — with varying levels of success— during codeoptimization, but it is often impossible to fullyautomatethe transformationsneededto convert an algorithm into a form matching the hardwarecapabilities.

Of the mechanismsusedto improve performance,vectorprocessingis the most familiar, principallybecauseit appearsin essentiallyall production supercomputers.Distributedprocessing,both in SIMDand MIMD variants,is becomingincreasinglypopular,while the superscalarapproach— whichcombinesfeatures such as very short vector pipelines and simultaneousmultiple instruction execution — isprominent amongmore recentworkstations.Single-and multi-level cachesare also playing an increas-ingly important role in ensuringprompt delivery of data to the processingunit. It seemscertain thatfuturecomputerswill rely on most, if not all, of these(andother)features.

The userwith a computationallyintensive problemwill often need to extract the maximum perfor-mancethat a given machineis capableof delivering, or reasonablyclose to it, which implies organizingthe computationsto satisfythe hardwarepreferences.For somecombinationsof processorandalgorithmthis is not difficult, for othersthe task is moredaunting.

D.C. Rapaport/ Multi-million particlemoleculardynamics.III 303

1.4. Supercomputerbasedapproachesto MD

One particular instanceof a computation that is not readily adaptedto computerswith complexarchitectureis the MD simulationof systemsin which the interactionbetweenparticlesis both simple inform, such as the Lennard-Jonespotential,and short-ranged.For theseproblems,a substantialfractionof the work is spent determiningwhich very small subsetsof particlesare within interactionrangeofeveryotherparticle,a taskcarriedout relatively inefficiently by avectorprocessor.Furthermore,whensuch an MD computation is implementedon a distributed-memorymulti-processorsystem, althoughinformation is localized, in the sensethat details of individual particlesneedonly be known to otherparticlesin the immediatevicinity, ensuringthat the datais residentin the correctprocessorcan lead tosignificant communicationoverheads.Yet anotherexample of the way in which this type of MDcalculationfrustratesthe assumptionsunderlyingmodernprocessordesignis the way in which the dataassociatedwith groupsof nearbyparticlesis scatteredthroughoutmemory, resultingin randommemoryaccesspatternsthat can successfullydefeatthe bestcachemanagementstrategies.Despitetheseinherentdifficulties, the heavycomputationalrequirementsof manylarge-scaleMD simulationsjustify the effortinvestedin attemptingto tailor the MD algorithmto the hardware.

Therehavebeena numberof MD algorithmsof this kind describedin the literature[1,2,9].Certaincommonfeaturescan beidentified,but thedetailsareoftenspecificto boththe hardwareandthe natureof the MD application.The goal of all theseendeavorsis performance,but memoryrequirementsmustbe taken into account since there is often a tradeoff betweenstorageand speed: the amount ofcomputation necessarycan often be reducedby reorganizing the data, an action that consumesadditional storage.The widely usedneighbor list technique[4] can easily double the storagerequire-ments of an MD calculation, while the use of layers [10] as a meansof achieving efficient vectorprocessingalsoimposesa heavystoragepenalty. For largesystemsavailablestoragemay preventuseoftheseadditionalbookkeepingmethods,resultingin amore time-consumingcomputation.

2. Parallel computing

2.1. Parallelismin general

The two mostwidely usedapproachesto parallelprocessingarebasedon parallelismat eitherdataorexecutionlevels — SIMD and MIMD, respectively.In practice,MIMD machinesarebuilt from setsofhigh-speedprocessingelements,eachwith its own private memory,which communicateover a fastnetwork. Each processorexecutesits own program, althoughthesewill often be copies of the sameprogram;processorswill generallyoperateon different quantitiesof data, with synchronizationbeingrequiredwheneverdata is to be exchanged.SIMD machines,on the other hand,normally involve muchlargernumbersof simpler processingelements(typically 1-bit processors)that operatein lock-step,eachhaving a small private memory and able to decide whether to executea particular instruction.Communicationis againover a network,but synchronizationis not usuallyan issuebecauseit is implicitin theway the computationsareorganized.

In both multi-processorarchitecturesthe communicationnetworks provide direct connectivity be-tweennearestneighborprocessorson a grid whosetopologymay rangefrom atwo-dimensionalnet to amaximally dimensionedhypercube.Datatransferbetweennon-adjacentprocessorsrequiresroutingoverone or moreintermediatesteps,andconflicting messagepathscan causedelays.The time consumedbyinterprocessordatatransfersdependson thedetails of the problemandthe degreeto which it hasbeensubdivided;in extremecasescommunicationcan evendominate(a situationnot easilyjustified). Sharedmemoryis avoided;this featuretends to be confined to expensive“mainframe” supercomputersbuilt

304 D.C. Rapaport / Multi-million particlemoleculardynamics.III

from only a small numberof processoi’s,where the taskof ensuringorderlyaccessto a commonmemoryis a manageableone.

The overall effectiveness,and in many computationalenvironmentscost-effectiveness,is stronglydependenton the natureof the computation and the computerarchitecture.Certain machinesaredefinitely preferablefor certain types of calculation. MD simulation (for short-rangedinteractions)involves operationsthat are best carried out on the simplest of (fast) serial computers,or on MIMDmachineswith equallysimpleprocessors,but in view of the fact that considerableamountsof MD workare now carried out on vector and SIMD computersit is desirablethat thesemachinesalso be usedeffectively. Vector and MIMD machineswere discussedin the earlier papersof this series;the focushere is on the SIMD environment.

2.2. Architectureof the CM2

2.2.1. ProcessorarrayA full-size CM2 ConnectionMachinewas originally viewed as a computerconsisting of 64K 1-bit

processors.With the subsequentadditionof floating-point hardwareandlater releasesof the microcodeand operatingsystemit can now also be describedas a 2K 32-bit processormachine;the two waysofregardingthe machineare known as “fieldwise” and “slicewise” modes respectively,with the latterproving decidedlymoreconvenientfor problemsinvolving largeamountsof floating-point computation.In a typical slicewiseconfigurationeachprocessorhas1 Mbyte of privatememory.

Overall control of the processorarray is the responsibilityof a microsequencer,which itself is underthe control of a front-endcomputer— typically a workstation.The CM2 canbe subdividedinto as manyas four smaller independentmachines,each controlled by its own microsequencerand executingaseparateprogram.There is a plethoraof architecturaldetails,many of which are relevantto the presentwork, andsubstantialperformancegainscanbe achievedif they are takeninto accountduring algorithmdesignand coding. Further information on the CM2 can be found in the manufacturer’spublicationssuch as refs. [11—131.

2.2.2. InterprocessorcommunicationCommunicationbetweenprocessorsis overa hypercubenetwork: hypercubeneighborscommunicate

most rapidly, butany communicationpatterncanbe supportedby thegeneralpurposerouter,albeitwithreduced bandwidth. In the former case — known as NEWS communication — all processorscansimultaneouslytransferdatawithout interference.In the latter,communicationcanrequire severalsteps,andthereis a distinctpossibility of collisions that further reducethe transferrate.

While the morevisible communicationtransactionsare thosetakingplaceamongprocessingelements,a lessapparentform of communicationis typified by messagesbetweenthe front-endcomputerandthemicrosequencer.Theseincludesetsof CM2 instructionsandaddressesneededto executesmall segmentsof the program.A substantialamountof time canbe spent in theseoperations,and if the front-end isheavily loadedwith other tasksthe CM2 can idle while waiting for the next batchof commands.

Taking parallelism into accountcalls for a detailedexaminationof the communicationsrequired tosupport the actualcomputation.In some instances,optimizing communicationmay require additionalcomputations,but sinceit is only the overall performancethat is at issue,and not how it is distributedbetweencomputationandcommunication,this is perfectly acceptable.A consequenceof this is that anevenwider range of computationalschemesmay haveto be comparedthan for a simple uniprocessorsystemto establishwhich is the mosteffective.

2.2.3. ProgrammingVector, matrix, and more generalarrayquantitiesare mappedonto the processorsas uniformly as

possible;wheneverthearraysizeexceedsthe numberof processorsavailablemultiple arrayelementsare

D.C. Rapaport/ Multi-million particle moleculardynamics.Ill 305

automaticallymappedto distinct storagelocations in each processor.On a more abstract level themachinecanbe regardedas assigninga singlevirtual processorperarrayelement.Arrays that havethesamerankand index rangeswill havecorrespondingelementsmappedto identicalprocessors,so thatoperationsinvolving theseelementswill not entail anyinterprocessorcommunication.The floating-pointunits incorporateshortvectorpipelines,andarrayoperationsin which thereareseveralmembersof eacharrayper processorarevectorized.Although the usermay chooseto remainblissfully unawareof theseand related issues,a certain appreciationof the way the hardware operatescan lead to enhancedperformance.

CM2 programminguses standardlanguagessuch as Fortran and C, with support for the uniquehardwarecapabilitiesbeingprovided by languageextensions(including the numerousarrayoperationsthat are part of Fortran 90), non-standardcompiler directivesto help organizestorage,and specialsubroutinesthat both provide accessto systemservicesand help overcomecompiler shortcomings.Taking the architectureinto accountcan vary from as simplean act as settingarraysizes to valuesthatthe machinehandlesefficiently (typically powersof two), through attemptsto minimize communicationand,wherecommunicationis unavoidable,attemptingto confine data transfersto adjacentprocessingelements,and in extremecasesevenusingmicrocode(currentlyan undocumentedfeature)to optimizeperformance.Detailsof the many factorswhich mustbe consideredduring codeoptimization,togetherwith information on the capabilitiesandfailings of the compiler in this respect,will be found in ref. [131.

The presentstudy employedFortran.The CM Fortrancompiler is still evolving, andat the time thiswork was carried out it was unable to effectively handle even some relatively common languageconstructs.An execution-timeprofile canbeusedto obtain detailsof wherethe programspendsmostofits time, allowing some of the more critical areasto be de-emphasized,either by use of alternativelanguageconstructs,or with theaid of specialsubroutinesdesignedto bypasssuchproblems.While thereis little of lastingintellectualvaluein suchdetails,they are mentionedbecauseproblemsof a similartypeappearin manyCM2 applications.

3. Previous work

3.1. Problemspecification

The problem usedin previousperformanceassessmentof the different CM2 algorithms is a 3DLennard-Jonesfluid with a cutoff in the range 2.3—2.5o (where o- is the characteristicparticle“diameter” that will subsequentlybe set to unity). For many large-scaleMD studiesapurely repulsiveversionof the potentialwith an evensmallcutoff (21”6u) is all that is required;the smallerthe cutoff thefewer the interactionsrequiringevaluation.The shorterrangealternativeprovidesan evenmoresevereperformancetestfor the CM2 sincethe distribution of effort betweenactualforce computationandthework neededto rearrangedatainto an efficient format shifts towardsthe latter. The presentpaperdealswith the 2’~6o-case,but the effectof cutoff on performancecanbe estimated.It will becomeapparentthat there are more serious impedimentsto detailedcomparisonwith previouswork, since differentprogramminglanguages,hardwaremicrocoding,andalternatewaysof viewing the CM2 organizationareinvolved.

3.2. Fine-grainedcellsubdivision

This approachis describedin ref. [141and is basedon subdividingthe spatial regionoccupiedby thesysteminto a three-dimensionalcell grid. The way in which cell sizeaffectsperformancewasoneof the


aspectsinvestigated,andof thepossiblechoices— rangingfrom substantiallyfewer cells thanparticlestosubstantiallymore — the bestperformancewas producedusingapproximatelyone cell perparticle.

This calculationwascarriedout in fieldwise mode.Two mappingsof the datato the virtual processorarrayare employed.The first is for general particle representation,with each particle being the soleoccupantof a virtual processor.The second,usedonly during the interactioncalculation,associatesavirtual processorwith eachcell; the cell itself maybe singly or multiply occupied,or empty.There is nosystematicrelationshipbetweenthesetwo representations,so that transfersbetweenthem,which mustbe carriedout twice during each timestep(to insert coordinatesand to retrieve interactions),involvetotally arbitrarycommunicationpatterns.The useof dual datarepresentationsalso increasesthestoragerequirements.

The programming languagewas CM * Lisp, and the tests consideredsystemswith up to 512Kparticles. A techniquefor increasingprocessingspeedby reducingthe occupancyof the maximallyoccupied cells by unity was described;since computationaleffort is proportional to the squareof themaximumoccupancy,allowing the front-endcomputerto be responsiblefor pairing the final particleinthe maximally occupied cells reducesthe work. The actualperformance,which scalesapproximatelylinearly with systemsize, will be discussedlater.

This approachresemblesthe layer method introducedfor vector processing[101.The occupantsofvirtual processorsin the cell representationareprocessedin an identicalmannerto the cell occupantsineachlayer. The methodusedto insert particlesin the cells is alsoidentical in the waymultiple occupancyis handled:cell assignmentsare madebut can then be overwritten by a subsequentassignment.Thesimilarity is hardly surprisinggiventhe relateddatadependencyrequirementsof vectoranddata-parallelhardware.

3.3. Coarse-grainedcell subdivision

An alternativeapproachto the CM2 implementationof MD is describedin ref. [15]. It is very similarin concept to the earlier method, but more coarse-grained,in that the cell size is determinedbyinteractioncutoff rather than basedon unit averagecell occupancy.With the enlargedcells, particlesmustbe in the sameor in neighborcells to interact,andso the numberof cell pairings is reduced.Thealternativeslicewiseprogrammingmodewasadopted.

A number of specialized programmingtechniqueswere introduced to achieve a high level ofperformance,not all of which are presently available to the general user. The first of these ismicrocoding,whichprovidesmuchgreatercontrol overthe hardware;this ensuresthat the floating-pointprocessorsare used optimally, and careful registerallocation results in fewer memory accesses.Thisapproachcuts the interactioncomputationtime by a factor of three.The secondtechniqueis the useofoverlappedmultiple datatransfersin different hypercubedirections;by usingcommunication“stencils”the numberof datatransfersneededin 3D to accessthe 26 neighboringcells is reducedfrom 26 to justfive. While the actualperformancefigures — the largestsystemcontainedonly 18K particles— will bediscussedlater, it is obvious that these“in-house” methodsaccountfor the gainin performance.

Although ref. [14] showedthat unit cell occupancyis closeto optimal, the specializedtechniquesusedin ref. [151could tilt the balancein favor of larger cells, but this point was not explored.The formermethodrequiresmorecommunicationto transferdataamonga largernumberof cell pairs,whereasthelatter doesnot useNewton’sthird law, preferringthe extracomputationto an additional seriesof datatransfersto restoreinteractiondatato the particlerepresentation.An implementationbasedon neighborlists was also discussed,but the performance,thoughbetter than fine-grainedcells, fails to match theproposedcoarse-grainedmethod.

D.C. Rapaport / Multi-million particle moleculardynamics.Ill 307

4. Algorithm

4.1. Overall organization

There are severalbasic componentsto any MD algorithm: (a) preparationof the initial state, (b)interaction calculation, (c) integration of the equationsof motion, (d) measurementsof physicalpropertiesof interest, and (e) miscellaneousadministrative tasks such as checkpointing. It is theinteractionsthat are practically always the major computationaltask, and it is here optimization isimportant. Of the remainingpartsof the calculation, initialization is carriedout only once, integrationrequiresa singleprocessingloop that compilersreadilyoptimize,andmeasurements,some of which mayinvolve substantialcomputation,arenormallysufficiently infrequentnot to causeconcern.

The approachadoptedin this paper is basedon methodsintroducedearlier for vector and MIMDprocessing.The systemis first divided into cells to allow easyidentification of potentially interactingneighbors— an approachcommonto almost all MD algorithms. In avectorprocessingenvironmentthecell datais subsequentlyarrangedinto a layer form that is amenableto vectorizationin severalways,dependingon hardware[1,10].For SIMD machinesthe layerorganizationwill not merelybe usedas atemporarybookkeepingscheme,but rather asa permanentframeworkfor organizingthe datadescribingthe stateof the system;each(virtual) processoris responsiblefor the entire contentsof a singlecell forthe duration of the simulation. This featurerepresentsthe principal differencebetweenthe presentmethodandthosepublishedpreviously.ThatvectorandSIMD architecturesutilize similar formsof dataorganizationis understandablesinceboth predicateindependentprocessingof arrayelements,eithersequentiallywithin vector pipelines or simultaneouslyon parallel processors.The overlap with theMIMD approachis to be foundin the tasksneededto ensurethat particlesare alwaysallocatedto thecorrectprocessors.

An outline of the methodis asfollows; a moredetaileddescriptionof the key featuresappearslater inthissection.At thebeginningof the run theparticlesare assignedto cells — andhencealso to processors— on the basisof their coordinates.The first particlethat is assignedto anygivencell is placedin thefirstlayer, the secondin the nextlayer, and soon. Cells will typically containdifferent numbersof particles,andtheremustbe an upperboundto the occupancythat cannotbe exceeded;a record is kept of howmany particlesoccupyeachcell. Eachparticlestoragelocation includesspacenot only for the dynamicalvariablesbut also for the serial number assignedto eachparticle (its “identity”); locations that areunoccupiedhavethis quantityset to zero.

An individual timestepentails the following sequenceof operations:(a) compute interactionsof allparticlepairs within eachcell; (b) for each layerwith non-zerooccupancy,andfor eachpossibleoffsetbetweenneighboringcells,copy the coordinatesandcompute the interactionsof any particlespresentwith thoseoriginally in the cell; (c) integrate;(d) for eachspatial dimensiondeterminewhich particlesareno longer in the correctcells andmove the associateddatato the appropriatecell; (5) compressthedatain eachcell to eliminate holesleft by departingparticles(this is a simpler alternativeto usinglinkedlists to managethe cell data).

4.2. Fortran notation

CM Fortran conformsto the array-processingextensionsof Fortran 90. Subject to the syntaxrules,entire arrays can be processedwith a single command; contrastthis with most existing versionsofFortranthat requiredexplicit loopsoverthe arrayelements.Array sectionscanbeprocessedwith similarease.Adopting thesenew (for Fortran) featuresnot only resultsin moreconciseandreadablesoftware,but also helps reveal instances of parallelism inherent in the algorithm that might otherwise goundetected.

308 D.C. Rapaport/ Multi-million particlemoleculardynamics.II!

To help use the hardwareeffectively, CM Fortran incorporatesan extensionthat distinguishesbetweenthosearraydimensionsthat are spreadacrossprocessorsandthereforetreatedin parallel(usingNEWS communicationwhere necessary),and those that are serially stored in the memory of eachprocessorandtreatedsequentially.In thecaseof MD, eachquantity(e.g.a coordinate)is representedbya multi-dimensional array: the array dimensionsusedfor cell indexing and which therefore identifyparticular(virtual) processorsarespreadacrossthe processorarray,but the dimensionthat specifiesthelayers in which particlesareplaced is mappedserially ontoeachprocessor.Thus, unlike MD codefor“conventional” computerswhereone-dimensionalarrayssuffice for all particlevariables,CM2 codeusesarraysof rankonegreaterthan the spatial dimensionof the system.

4.3. Description

4.3.1.Array definitionsTo ensurethat completeparallelismis achievedit is importantto be awareof the way in whicharrays

used in the calculation are mapped onto the processorset. Two classesof arrays are introducedexplicitly, those containing permanentdata describingthe state of the system, and those holdingtemporaryresults.The compilermay also generateits own temporaryarrays(althoughin the interestofefficiency thereoughtto beasfew of theseas possible),but if the codeis properlyorganizedthesearrayswill also be processedin parallel.

The arraydimensionsare determinedby the sizeof the cell grid, g r i d — x (etc.), and the maximumpermittedcell occupancy(i.e. the numberof availablelayers), max — o c; spacemust also be reservedforparticle datacopied from the neighborcells. Valid arrayoperationscan only involve arrays or arraysectionswith the samerankand size; in the interestof readability, placeholdersfor array indices thatstandfor entirearraysections(denotedby colons in Fortran 90 syntax)havebeenreplacedby ellipsesinthose arrays where some of the indices must appear explicitly. The arrays are as follows (particlevelocitiesarealso neededbut arenot mentionedin this paper):

rx(maxoc,gridx,... ) — coordinate, also ryax(maxoc,gridx,...) — acceleration, also ay,...,d(maxoc ,gr I d_x, ... ) — particle identity,

occount(gridx,...) —celloccupancy,pi ck(grid x, ... ) — particle is selected,dx (gr I d_x, ... ) — inter-particleseparation,also d y,...,d d (g r I dx,...) — squareof inter-particledistance,temp(grid,...) — temporary array.

4.3.2. Interaction computationsThe interactionsare computedin two distinct stages,which, though having much in common, are

separatedfor clarity andconvenience.In the first partthe interactionsbetweenparticlesbelongingto thesamecell are computed,an operationnot requiring any communication.The secondpart involvesparticlesin neighboringcells andincorporatesthe necessarycommunicationto placeparticledatawhereit is needed.The force betweenparticles,expressedin conventionalMD reducedunits [4], hasthe form

F = 48(rJ’4—0.5rJ8)r11, Ir1jI <2I,~’6,

“ 0, I r~1I � 21/6.

The in-cell interactioncomputation is shown below. The w h er e instruction is the equivalentof aparallel i f; the instructionsit governsareonly appliedto arrayelements(or equivalently,areperformed

D.C.Rapaport/ Multi-million particlemoleculardynamics.bbl 309

in thosevirtual processors)with indicesfor which the conditionis satisfied.The instantaneousmaximumcell occupancyis given by h i —o c, and huge is a sufficiently large numberused for distinguishingbetweenparticle pairs lying inside andoutsidethe interactioncutoff r_cu t( = 2 1/6) All accelerationsare initially set to zero. Note that just the x-componentsof the computationsare shownexplicitly, andthata fewminor stylistic simplifications havebeenmadeto the Fortran.

do i2=1,hioc—1do i112+1,hioc

dd = hugewhere (id(i1,...)>O and id(i2,...)>0)

dx=rx(il,...)—rx(i2,...)dd dx**2 +

endwherewhere (dd<rcut**2)

templ /dd**3temp48*temp*(temp—O.5)/ddax(il,...)=ax(il,...)+temp*dxax(i2,...)=ax(i2,...)—temp*dx

endwhereenddo

enddo

The interactionsbetweenparticlesin neighboringcells are evaluatedas describedbelow. Followingthe recommendedpractice, communication is kept apartfrom computation — coordinatesare firstimportedfrom all neighboringcells andonly then is thecomputationperformed.Becauseof theperiodicboundariesthe transfersare basedon a seriesof circular shifts, one for each spatial dimension; forclarity the shifts are shownin separatestatements.The threeargumentsof the c sh I f t (circularshift)subroutinespecify the array name, the array dimensionalong which the shift is required,and thedirection (with a positive value denoting a shift in the direction of decreasingindex). Such NEWStransfersare the onesthe CM2 performs best.The contents(if any)of a total of 26 neighborcells areimportedfor 3D systems(8 in 2D).

The loopsare arrangedto considereach iayerof the neighborcells separately,but all directionsaretreated together; the approachis by no meansunique. The circular shifts automaticallyhandle cellwraparoundassociatedwith periodicboundaries,althoughthecoordinatesof the particlesinvolved mustbe adjustedin a separatestep. Only the interactions affecting particles originally in the cells areaccumulated;those for the imported particlesare treatedduring the shift in the oppositedirection.Variablesintroducedhere include copy,which indicatesthe neighborcell offset in eachof the spatialdimensions(the valuesare 0, ±1), and the length of the container 1. enx; the function mod is thestandardmodulo function.As before,only the x-componentsof the computationsanddatatransfersareshown explicitly.

do 131,hioc1=0

do dir =0,26

copy(1)mod(dir,3)—1

copy(2)mod(dir,9)/3—1

copy(3) dir/9—1if (copy(1)<>0 or copy(2)<>0 or copy(3)<>0) then

1=1+1


temprx(i3,...)temp = cshi ft (temp,1 ,copy(1 ) )

tempcshift(temp,2,copy(2))

temp c shif t ( temp ,3, copy (3)if (copy(1)<0) temp(1,...)temp(1,...)—lenxif (copy(1)>0) temp(gridx,...)temp(gridx,...)+lenx

rx(hioc+i,...)temp

tempid(i3,...)

temp = c shift (temp, 1, copy (1))

temp=cshift(temp,2,copy(2))

id(hioc+i,... ) cshift(temp,3,copy(3))

endi fenddo

do 12 = h bc + 1 ,h ioc + 26

do i11,hioc

dd = huge

where (id(il,...)>0 and id(12,...)>0)

dx=rx(i1,...)-rx(12,...)

dd = dx**2 +

endwherewhere (dd<rcu t **2)

templ /dd**3

temp48*temp*(temp—0.5) / dd

ax(I1,...)ax(il,...)+temp*dx

endwhere

enddo

enddo

enddoThe 2D versionof the codeis very similar.

do 131,hioc1=0do dirO,8

copy(1)mod(dir,3)—1copy(2) =dir /3—1if (copy(1)<>0 or copy(2)<>0) then

i = i +1temprx(i3,...)tempcshift(temp,1,copy(1))

temp=cshift(temp,2,copy(2))if (copy(1)<0) temp(1,...)temp(1,...)—Lenxif (copy(1)>0) temp(gridx,...)temp(gridx,...)+lenxrx(hioc + I,...) = temp

endi fenddo

do 12 = h bc + 1 ,h bc + 8

~nddoenddo

D.C. Rapaport/ Multi-million particle moleculardynamics.III 311

If a quantity such as the internal energy is required it can be evaluatedat the sametime as theinteractions,the only pointworth noting is that the potentialenergyof eachparticlemust be evaluatedindividually, andthe resultingvaluessummed(a reductionoperation).

4.3.3. Cell updatesWith the forceson all particlesevaluated,integrationof the equationsof motion is a trivially parallel

task.As a resultof the coordinatechangessomeparticleswill no longerbein the correctcells,anda fewmay require their coordinatesto be modified becauseone or more periodicboundarieshavebeencrossed.There are four (in 2D) or six (in 3D) possibledirectionsin which such crossingscanoccur,andsince a particle crossing near an edgeor corner may require wraparoundof more than one of itscoordinates,the spatial dimensionsare treated successively;during this procedurea particle cantemporarilyresidein a cell that is not the final destination.

The outermostloop of the code(the 3D version)that relocatesparticlesno longer in their correctcells is as follows. New variablesused here are the spatial dimensionof the move, di r, and its sign,move.

do k0,5

dirk/2+1move mod(k,2)*2—1

do 11,hioc{find out—of—cell particles in this layer and move them)

enddo

{compress particles stored in ceLLs)

enddo

The task of determiningwhich particleshavecoordinatesthat no longer lie within the cell limits(consideringjust one directionat a time) andthen moving them,is carriedout in two stages.The firstlocatestheparticlesaffected,marksthem with the Booleanvariablep1 c k, andmodifies the coordinateswherevera periodicboundaryis crossed.The codeincludesloopsof the form do n x = ... whichusetheloop index itself in the computations;to force theseloops into parallelform, the do operationmustbereplacedby the specialCM Fortran fo r a 11 (nx =...) constructwhich the compiler recognizesas aloop whichshouldbe executedin parallel.The quantity ceI I x (= I e n_x / g r I dx) is the length ofthe cell edge.

If (dirl) thenif (move<0) then

do nxl,gridxpick(nx,...)id(i,nx,...)>0 and rx(i,nx,...)>nx*cell_x

enddowhere (pick(gridx,...)) rx(i,gridx,...)=rx(i,gridx,...)—lenx

else

do nxl,gridxpick(nx,...)=id(i,nx,...)>0 and rx(1,nx,...)<(nx—1)*cellx

enddo

where (pick(1,...)) rx(i,1,...)=rx(i,1,...)+Lenx

endi feLseif (d1r2) then

ehdif


The secondstage is the move itself. This involvesa circular shift of the array p i c k in the appropriatedirection, incrementingo c_count for the cells that are to receivea newparticle,transferringthe dataitems (coordinates,etc.) to temporary holding storageand, finally, storing them, but only if theycorrespondto a particlethat actuallyentersthe cell. A testto ensurethat the maximumcell occupancyisnot exceededis not shown.After all quantitieshavebeentransferredthe array p i c k is returnedto itsoriginal location andusedto markthelocationsthat havebeenvacated.The communicationusedfor theinter-cell moves is entirely local in nature.The subroutine cm f_aset_i d is necessaryto overcomecompilermyopia in recognizingparallelintra-processordatatransfers(one exampleis indirect address-ing along the serial dimensionof an array). Here it is usedto selectively transferdata from holdingstorageto the properdestination,with p i c k determiningwhetherthe transfer is to be carriedout; theother argumentsspecify arraysthat are, in order, the destination,sourceand destinationoffset. Thearray t emp is necessaryto overcomesyntaxrestrictionson the useof arraysectionsin certain paralleloperations.Thoughnot shownhere,all quantitiesassociatedwith the particlesmustbe transferred.

pick = cshi ft (pi ck,di r,move)where (pick) occountoccount+1temprx(i,...)

tempcshift(temp,dir,move)

c a LI cmfa set_id (r x,temp , o c_count ,pi c k)pickcshift(pick,dbr,—move)

where (pick) id(i,...)0

The final codesegmentdealswith compressingstored particle data to eliminate the holes left by

departingparticles; this operationis repeatedfollowing each move direction. Here cmf_aset_i d isused to help compressarrays with possiblevacanciesalong the first (serial) array dimension — anoperationintendedto reducethe activelayercount — with p 1 c k determiningwhich positionshold validparticle data. The function maxv a I computesthe maximum of an entire array in a highly efficientmanner.All particledataarraysaresimilarly treatedwithin the loop.

h io c = maxva l(o c_count)occount = 0

do ii,hiocpickid(i,...)>0

where (pick) occountoccount+itemp~rx(i,...)

call cmfasetld(rx,temp,occount,pick)where (occount<i) id(i,...)0

enddo

h i o c = max va 1(o c_count)

The need for temporary storageand especiallythe use of the cm f_aset_i d subroutinemake thecode harder to follow than would otherwisebe the case.Alternative, more intelligible methodsofachievingthe sameresult may still executein parallel, but as a consequenceof the compiler myopiaalluded to above,will usegeneralcommunicationoperationsto transferdatabetweenlocationswithineach processor.The present code, though somewhatawkward on account of these limitations, isexecutedentirely in parallel.

4.3.4. InitializationPreparation of the starting state is a task performed only once and therefore does not need to be

optimized.The region is filled by placingparticlesat the sitesof a regularlattice; particlesareassigned

D.C. Rapaport/ Multi-million particle moleculardynamics.III 313

velocities that are randomin direction andwhosemagnitudescorrespondto the desiredtemperature(somesubsequenttemperatureadjustmentmayberequired).The way in which initialization on the CM2differs from a uniprocessorimplementationis the needto ensurethat all particlesare assignedto thecorrectprocessorsandthat oc count is initialized.

4.4. Featuresomitted

Severalwaysof enhancingperformancewerenot explored,but should receive attentionprior to anyextendedseriesof computations.The mostobviousimprovementwould be to incorporateNewton’sthirdlaw and halve the numberof interactingpairs.This would approximatelydouble the computationrateand,although not increasingthe total communicationeffort, would require a separatesequenceoftransfersto returnthecomputedinteractionsto the original cells. Theoverall impacton 3D performance(see the following section) would not be greatunless communicationis also optimized. Use of thefront-end processorto reduce layer pairings [141 might also prove worthwhile if the front-end issufficiently responsive.

Further optimization calls for a more detailed understandingof the CM2. Use of overlappedcommunication,as describedin ref. [15], hasthe potential for eliminatingalmosthalf the totalwork in3D, but requiresCM2 stencils.An alternative po I y sh I f t subroutineis available that could producesimilar performancegainssince it facilitatesoverlappedcommunicationand reducedsetupoverheads;use of this capability requires the same detailed planning of data transfers neededfor stencils.Additional performancegainscould be achievedby microcodingthe interactioncalculations.

5. Performance

5.1. Measurements

Performancemeasurementswere carried out for severalsystemsizes in two and threedimensions.Most measurementswere madeusinga quarterCM2, namely512 processors,but the largestrunsused1024 processorsto obtain theneededstorage.The cell arraysizeswerechosento produceunit meancelloccupancy.

The processingtimes, in units of J.Lsec perparticle-timestep,areshownin table 1. The fraction of themachineused is also shown.The choice of systemsizesstemsfrom the useof a squarearrayof initialparticle positionsin 2D andan FCCarray in 3D; densitiesare 0.5 in 2D and0.71 in 3D.

A breakdownof the relativecomputationandcommunicationtimesaccordingto the tasksis shownintable2; theseresultswere derivedfrom run-time performanceanalysisusingthe CM Prism utility. Analternativebreakdownby CM2 operationtype appearsin table 3.

Table 1Processingtime per particle(in p.s) for different systemsizes(N) in two andthreedimensions;the fraction of the CM2 usedisindicated.

Dimension CM2 N Time

2D 1/4 65536 9.8147456 9.9

1/2 2359256 5.73D 1/4 42592 38.2

143748 39.11/2 1149984 23.2


Table 2Fractionof processingtime usedin the principal tasks.

Task 2D 3D

copyneighbors 0.22 0.46neighborinteractions 0.50 0.38move particles (1.22 0.13compressdata (1.06 0.03

Table 3Relative processingtime accordingto operationtype.

Operation 2D 3D

computation 0.70 0.52communication— NEWS 0.22 0.42communication— reduction 0.04 0.04communication— front-end 0.04 0.02

Several points emerge from these results. Foremost is the substantial fraction of the time devoted tocommunication, particularly in 3D. Neither the interactions between particles within the same cell northe integration of the equations of motion appear in table 2 since they account for less than 1% of thework. The fact that doubling the machine size does not exactly halve the processing time is probably dueto more subtle hardware characteristics. NEWScommunication (copying coordinates and movingparticles) accounts for practically all the non-computational time. Of the other two kinds of communica-tion, reduction is used, for example, in evaluating array maxima, and the small amount of front-endcommunication supports machine operation.

5.2. Comparison

The performanceof the present algorithm for the 3D case can be compared with earlier measure-mentson the CM2 if allowanceis madefor the larger interactionrangeusedthere.With hardwareofthis complexity, where a variety of factors (some obvious, others less so) contribute both positively andnegativelyto performance,suchderivedestimatesmaynot alwaysbe entirely reliable. Otherdifferences,namely whetherfieldwise or slicewise mode is used,programminglanguage,and the useof microcodeand communication stencils, cannot be quantified without detailed measurement. The results appear intable 4; the times have been adjusted to the interaction cutoff used in the present work and normalizedto a full CM2(2K 32-bit processors).

To put these figures into perspective the discussions earlier in the paper should be recalled. (a) Thefine-grained cell approach can be speeded up by about 30% if the front-end computer assists with themaximally occupied cells. (b) The coarse-grained method gains a factor of at least four by using low-level

Table 4Normalizedestimatesof time (in p.s) per particle-stepfor different methods.

Method Time

“fine-grained” cells 10“coarse-grained”cells 2layers 10


programming.(c) The layer method would run about 20% faster if only half the interactionswerecomputed,andpossiblyfasterstill if the front-endcomputerwas usedas in (a); it alsohasthe advantageof usingonly local communication.

The CM2 performancecan alsobe comparedwith measurementsfor the sameMD problemon othermachines.Performancefiguresfor vector and MIMD machineswere given in refs. [1,21.Of particularinterestis the speedof a typicalsuperscalarworkstation— machinesthat requireno specialprogrammingeffort to achieve a very respectablelevel of performance.The time per particle-stepfor an IBM6000/320,usingneighborlists andtabulatedinteractions,is 15 p~secin 2D and 33 ~isecin 3D. Clearly,for this particular class of computation,it takesonly a few of thesemachinesto reachthe computingpowerof a full CM2 or, for that matter,a largevectormainframecomputersuch as the CrayYMP.

The performanceis best interpretedin a price—performancecontext,whereit is obviousfrom a costcomparisonthat the superscalarworkstationis aheadby an orderof magnitudeor more.The reasonsforthis are the following. The vectorizedimplementationentails substantialoverheads;for small systemsneighborlists can beusedto improveperformance,but the heavymemorypenaltyprohibitsthisfor largesystems.MIMD performance(e.g.the Intel iPSC/860)is similar to the workstation(neighborlistswerenot used in the MIMD study, althoughtheycould havebeen),which is to be expectedgiven that theprocessorsareroughly equivalent;thereis a differencein cost-effectivenessbecauseof the higherpriceof MIMD machineswith a significant investmentin communicationresources.Even the superscalarworkstationis not a perfectMD platformbecauseof a mismatch:much of its performanceis achievedthrough the useof high-speedcachestorage,and the natureof MD is such that cacheeffectivenessissignificantly reducedbecauseof the apparentlyrandommannerin which memoryaccessesoccur.

6. Discussion

6.1. Featuresof the algorithm

The advantageof the method introducedin this paperis its straightforwardness.All data associatedwith a particular particleis containedin the processorwherethe particleresides,andwhenthe particlemovesit carriesthe dataalong. In this respectthe methodresemblestheMIMD approach,the principaldifferencebeing the numberof particlesper processorthat the algorithmis designedto accommodate:herethe expectednumberis of order unity, whereasthe MIMD approachcaters for relatively largenumbers.

The mostconspicuousweaknessof all methodsthat arebasedon a subdivisioninto cells occupiedbyvery small numbersof particles is the acutesensitivity to densityvariations.In the presentwork thesevariations areconfined to local fluctuationsin cell occupancy,with the amount of computationbeingdeterminedby the maximally occupied cell. This is identical to the situation encounteredin thelayer-basedvectorizedapproach.Theweaknessis inherent in the algorithmdesign,andbecauseof thisthe algorithmperformsbestwith homogeneoussystemsat comparativelyhigh density.

6.2. Extensibility

Thereis nothing in principle preventingthe algorithm being used for more complexMD studies.Small rigid moleculesrepresenta trivial extensionof the method; all that is required is a cell sizeexceedingthemaximum interactionrangemeasuredbetweenmoleculecentersof mass.Multiple speciescan alsobe handled,either by extendingthe role of the particle identity (or by introducingadditionaldescriptors);then if the different interactionshavesimilar functional forms a suitablecombination ofparticleidentitiescanbe usedto referencethe interactionparameters,permittingall interactiontypesto

316 D.C. Rapaport/ Multi-million particle moleculardynamics.Ji1

be evaluatedsimultaneously.Polymerchainsin solution andpolymermeltsarealso systemsto which thepresent approach is readily adapted.

6.3. Data vs. executionparallelism

Data parallelism has proved, at leastuntil now, an extremelycost-effectiveway of achievingmassiveparallelism. Some problems adapt naturally to this environment and are easily programmed to use SIMDmachinesefficiently. Such problems are typically characterizedby a simple grid-like structurewithsimilar operationsbeing applied to all grid elements,and by locality of reference— the needfor gridelementsto only know abouttheir immediateneighbors.MD simulationdoesnot naturally fall into thiscategory,which explainsthe extensivedatarestructuringandthe factthat performancefalls far short ofthe machinecapabilitydisplayedfor moreappropriateproblems[161.

Advancesin hardwarearegraduallyerodingthe benefitsof SIMD architecture,andgiven the greaterflexibility of the MIMD approachit is likely that the demiseof the former is imminent. An MIMDmachineis perfectly capableof emulatingSIMD functionality whenrequired.But for problemssuch asMD, where enforcedconformationto SIMD constraintsleads to serious inefficiency, MIMD is thepreferredarchitecture.The MIMD implementationof MD hasall theright properties:it is fully scalable,communicationis confinedto neighboringprocessors,and the amountof datathat must be transferredgrows lessrapidly than the numberof particlesperprocessor.

Both SIMD and MIMD approachesto MD requirethat communicationoperationsmustbe expressedexplicitly in theprogram(circularshiftsfor SIMD, messagepassingfor MIMD), but the greaterflexibilityof the latter, in particularthat individual processorsarefree to handledatain differentways, contributesto a simpler algorithmdesign.MIMD parallelismis more coarse-grained,and thishelps smoothout theeffectsof local densityfluctuationsthat areso costly in the SIMD method.If large-scaleinhomogeneitiesare present,however,eventhe MIMD approachwill haveto be augmentedby someform of dynamicload balancingto avoid a situationanalogousto that encounteredin the fine-scaleparallelismof SIMD;given the increasinguseof MIMD hardwarethis issuewill haveto be addressedin the nearfuture. Asmassiveparallelismbecomesavailablethereis little questionthat it will be put to good usein large-scaleMD simulation.

Acknowledgements

The authorwould like to thank the SupercomputerComputationsResearchInstituteat Florida StateUniversity for its hospitality while this studywasbeing carriedout. The work was supportedin part bythe US Departmentof Energythrough contractno. DE-FCO5-85ER250000.HagaiMeirovitch and PaulOppenheimerare thankedfor helpful discussion.

References

[1] D.C. Rapaport,Comput.Phys.Commun.62 (1991) 198.[21 D.C. Rapaport,Comput.Phys.Commun.62 (1991)217.[31G. Ciccotti and W.G. Hoover, eds., Molecular Dynamics Simulation of Statistical Mechanical Systems(North-Holland,

Amsterdam,1986).[4] M.P. Allen andDi. Tildesley, ComputerSimulationof Liquids (Oxford Univ. Press,Oxford, 1987).[5] C.R. Catlow, S.C. ParkerandM.P. Allen, eds.,ComputerModellingof Fluids, PolymersandSolids (Kluwer, Dordrecht,1990).[61F.F. Abraham,Adv. Phys.35 (1986) 1.[7] D.C. Rapaport,Phys.Rev. A 36 (1987)3288, and to bepublished.


[81W. Swope and H.C. Andersen, Phys. Rev. B 41(1990)7042.[9] W. Smith, Comput. Phys. Commun. 62 (1991) 229.

[10] D.C. Rapaport, Comput. Phys. Rep.9 (1988) 1.[11] J. Bailey,Thinking MachinesTech.ReportTR9O-1 (1990).[12] Getting Startedin CM Fortran (Thinking MachinesCorp.,Cambridge,MA, 1991).[13] CM FortranOptimization Notes:SlicewiseModel (Thinking MachinesCorp.,Cambridge,MA, 1991).[14] A.!. Mel’~uk,R.C. Gilesand H. Gould, Comput.Phys.5 (1991)311.[15] P. Tamayo,J.P.MesirovandB.M. Boghosian,Thinking MachinesTech.ReportMD91-207(1991).[16] A.D. Kennedy,mt. J. Mod. Phys.3 (1992) 1.

multi-millionparticle molecular dynamics - biurapaport/papers/93b-cpc.pdf · multi-millionparticle...

Documents