lecture 11: distributed memory machines and …...lecture 11: distributed memory machines and...

Lecture11:DistributedMemoryMachinesandProgramming

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

yanyh@cse.sc.eduhttp://cse.sc.edu/~yanyh

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)

– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel

• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems

• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools

Acknowledgement

• SlidesadaptedfromU.C.BerkeleycourseCS267/EngC233ApplicationsofParallelComputersbyJimDemmel andKatherineYelick,Spring2011– http://www.cs.berkeley.edu/~demmel/cs267_Spr11/

• Andmaterialsfromvarioussources

SharedMemoryParallelSystems:MulticoreandMulti-CPU

Node-levelArchitectureandProgramming

• Sharedmemorymultiprocessors:multicore,SMP,NUMA– Deepmemoryhierarchy,distantmemorymuchmore

expensivetoaccess.– Machinesscaleto10sor100sofprocessors– InstructionLevelParallelism(ILP),DataLevelParallelism(DLP)

andThreadLevelParallelism(TLP)• Programming

– OpenMP,PThreads,Cilkplus,etc

HPCArchitectures(TOP500,Nov2014)

Outline

• ClusterIntroduction• DistributedMemoryArchitectures

– Propertiesofcommunicationnetworks– Topologies– Performancemodels

• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives

Clusters

• A groupoflinkedcomputers,workingtogethercloselysothatinmanyrespectstheyformasinglecomputer.

• Consistsof– Nodes(Front+computing)– Network– Software:OSandmiddleware

Node Node Node…

High Speed Local Network

Cluster Middle ware

Top10ofTop500

http://www.top500.org/lists/2016/06/

( Ethernet,Infiniband….) + (MPI)

HPCBeowulfCluster

• Masternode:orservice/frontnode(usedtointeractwithuserslocallyorremotely)

• ComputingNodes:performancecomputations• Interconnectandswitchbetweennodes:e.g.G/10G-bitEthernet,

Infiniband• Inter-nodeprogramming

– MPI(MessagePassingInterface)isthemostcommonlyusedone.

NetworkSwitch

NetworkInterfaceCard(NIC)

Outline

NetworkAnalogy

• Tohavealargenumberofdifferenttransfersoccurringatonce,youneedalargenumberofdistinctwires– Notjustabus,asinsharedmemory

• Networksarelikestreets:– Link =street.– Switch =intersection.– Distances (hops)=numberofblockstraveled.– Routingalgorithm =travelplan.

• Properties:– Latency:howlongtogetbetweennodesinthenetwork.– Bandwidth:howmuchdatacanbemovedperunittime.

• Bandwidthislimitedbythenumberofwiresandtherateatwhicheachwirecanacceptdata.

LatencyandBandwidth

• Latency:Timetotravelfromonelocationtoanotherforavehicle– Vehicletype(largeorsmallmessages)– Road/trafficcondition,speed-limit,etc

• Bandwidth:Howmanycarsandhowfasttheycantravelfromonelocationtoanother– Numberoflanes

PerformancePropertiesofaNetwork:Latency

• Diameter:themaximum(overallpairsofnodes)oftheshortestpathbetweenagivenpairofnodes.

• Latency: delaybetweensendandreceivetimes– Latencytendstovarywidelyacrossarchitectures– Vendorsoftenreporthardwarelatencies (wiretime)– Applicationprogrammerscareaboutsoftwarelatencies (user

programtouserprogram)• Observations:

– Latenciesdifferby1-2ordersacrossnetworkdesigns– Software/hardwareoverheadatsource/destinationdominatecost

(1s-10susecs)– Hardwarelatencyvarieswithdistance(10s-100snsec perhop)butis

smallcomparedtooverheads• Latencyiskeyforprogramswithmanysmallmessages

I second = 10^3 millseconds (ms) = 10^6 microseconds (us) = 10^9 nanoseconds (ns)

LatencyonSomeMachines/Networks

8-byte Roundtrip Latency

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

MPI ping-pong

• Latenciesshownarefromaping-pong testusingMPI• Theseareroundtrip numbers:manypeopleuse½ofroundtrip timeto

approximate1-waylatency(whichcan’teasilybemeasured)

EndtoEndLatency(1/2roundtrip)OverTime

6.9745

7.2755

12.08059.25

11.027

nCube/2

CM5 CS2

Paragon

T3DT3D

Cenju3

T3E18.916

SP-Power3

Quadrics

Myrinet

Quadrics

1990 1995 2000 2005 2010Year (approximate)

• Latencyhasnotimprovedsignificantly,unlikeMoore’sLaw•T3E(shmem)waslowestpoint– in1997

Data from Kathy Yelick, UCB and NERSC

PerformancePropertiesofaNetwork:Bandwidth

• Thebandwidthofalink=#wires/time-per-bit• BandwidthtypicallyinGigabytes/sec(GB/s),i.e.,8*220 bitspersecond

• Effectivebandwidth isusuallylowerthanphysicallinkbandwidthduetopacketoverhead.

Routing and control header

Data payload

Error code

Trailer

• Bandwidth is important for applications with mostly large messages

BandwidthonSomeNetworks

Flood Bandwidth for 2MB messages

857225

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

B) MPI

• Floodbandwidth(throughputofback-to-back2MBmessages)

BandwidthChart

2048 4096 8192 16384 32768 65536 131072Message Size (Bytes)

T3E/MPIT3E/ShmemIBM/MPIIBM/LAPICompaq/PutCompaq/GetM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect

Data from Mike Welcome, NERSC

Note:bandwidthdependsonSW,notjustHW

PerformancePropertiesofaNetwork:BisectionBandwidth

• Bisectionbandwidth:bandwidthacrosssmallestcutthatdividesnetworkintotwoequalhalves

• Bandwidthacross“narrowest” partofthenetwork

bisection cut

not a bisectioncut

bisectionbw=linkbw bisectionbw=sqrt(n)*linkbw

• Bisectionbandwidthisimportantforalgorithmsinwhichallprocessorsneedtocommunicatewithallothers

OtherCharacteristicsofaNetwork

• Topology(howthingsareconnected)– Crossbar,ring,2-Dand3-Dmeshortorus,hypercube,tree,

butterfly,perfectshuffle....• Routingalgorithm:

– Examplein2Dtorus:alleast-westthenallnorth-south(avoidsdeadlock).

• Switchingstrategy:– Circuitswitching:fullpathreservedforentiremessage,likethe

telephone.– Packetswitching:messagebrokenintoseparately-routed

packets,likethepostoffice.• Flowcontrol (whatifthereiscongestion):

– Stall,storedatatemporarilyinbuffers,re-routedatatoothernodes,tellsourcenodetotemporarilyhalt,discard,etc.

NetworkTopology

• Inthepast,therewasconsiderableresearchinnetworktopologyandinmappingalgorithmstotopology.– Keycosttobeminimized:numberof“hops” betweennodes

(e.g.“storeandforward”)– Modernnetworkshidehopcost(i.e.,“wormholerouting”),so

topologyisnolongeramajorfactorinalgorithmperformance.• Example:OnIBMSPsystem,hardwarelatencyvariesfrom0.5usec to1.5usec,butuser-levelmessagepassinglatencyisroughly36usec.

• Needsomebackgroundinnetworktopology– Algorithmsmayhaveacommunicationtopology– Topologyaffectsbisectionbandwidth.

LinearandRingTopologies

• Lineararray

– Diameter=n-1;averagedistance~n/3.– Bisectionbandwidth=1(inunitsoflinkbandwidth).

• TorusorRing

– Diameter=n/2;averagedistance~n/4.– Bisectionbandwidth=2.– Naturalforalgorithmsthatworkwith1Darrays.

MeshesandTori

• Twodimensionalmesh– Diameter=2*(sqrt(n)– 1)– Bisectionbandwidth=sqrt(n)

• Twodimensionaltorus– Diameter=sqrt(n)– Bisectionbandwidth=2*sqrt(n)

•Generalizestohigherdimensions• CrayXT(eg Franklin@NERSC)uses3DTorus

• Naturalforalgorithmsthatworkwith2Dand/or3Darrays(matmul)

Hypercubes

• Numberofnodesn=2dfordimensiond.– Diameter=d.– Bisectionbandwidth=n/2.

• 0d1d2d 3d 4d

• Popularinearlymachines(InteliPSC,NCUBE).– Lotsofcleveralgorithms.

• Greycode addressing:– Eachnodeconnectedto

otherswith1bitdifferent.

001000

010 011

• Diameter=logn.• Bisectionbandwidth=1.• Easylayoutasplanargraph.• Manytreealgorithms(e.g.,summation).• Fattreesavoidbisectionbandwidthproblem:

– More(orwider)linksneartop.– Example:ThinkingMachinesCM-5.

Butterflies

• Diameter=logn.• Bisectionbandwidth=n.• Cost:lotsofwires.• UsedinBBNButterfly.• NaturalforFFT.

O 1O 1

O 1 O 1

butterfly switch multistage butterfly network

Ex: to get from proc 101 to 110,Compare bit-by-bit andSwitch if they disagree, else not

TopologiesinRealMachines

Cray XT3 and XT4 3D Torus (approx)

Blue Gene/L 3D Torus

SGI Altix Fat tree

Cray X1 4D Hypercube*

Myricom (Millennium) Arbitrary

Quadrics (in HP Alpha server clusters)

Fat tree

IBM SP Fat tree (approx)

SGI Origin Hypercube

Intel Paragon (old) 2D Mesh

BBN Butterfly (really old) Butterfly

Manyoftheseareapproximations:E.g.,theX1isreallya“quadbristledhypercube” andsomeofthefattreesarenotasfatastheyshouldbeatthetop

PerformanceModels

LatencyandBandwidthModel

• Timetosendmessageoflengthnisroughly

• Topologyisassumedirrelevant.• Oftencalled“a-bmodel” andwritten

• Usuallya >>b >>timeperflop.– Onelongmessageischeaperthanmanyshortones.

– Candohundredsorthousandsofflopsforcostofonemessage.• Lesson:Needlargecomputation-to-communicationratiotobe

efficient.

Time = latency + n*cost_per_word= latency + n/bandwidth

Time = a + n*b

a + n*b << n*(a + 1*b)

Alpha-BetaParametersonCurrentMachines

• Thesenumberswereobtainedempirically

machine a bT3E/Shm 1.2 0.003T3E/MPI 6.7 0.003IBM/LAPI 9.4 0.003IBM/MPI 7.6 0.004Quadrics/Get 3.267 0.00498Quadrics/Shm 1.3 0.005Quadrics/MPI 7.3 0.005Myrinet/GM 7.7 0.005Myrinet/MPI 7.2 0.006Dolphin/MPI 7.767 0.00529Giganet/VIPL 3.0 0.010GigE/VIPL 4.6 0.008GigE/MPI 5.854 0.00872

a is latency in usecsb is BW in usecs per Byte

How well does the modelTime = a + n*b

predict actual performance?

ModelTimeVaryingMessageSize&Machines

MeasuredMessageTime

LogP Model

• 4performanceparameters– L:latencyexperiencedineachcommunicationevent

• timetocommunicatewordorsmall#ofwords– o:send/recv overheadexperiencedbyprocessor

• timeprocessorfullyengagedintransmissionorreception– g:gapbetweensuccessivesendsorrecvs byaprocessor

• 1/g=communicationbandwidth– P:numberofprocessor/memorymodules

LogP Parameters:Overhead&Latency

• Non-overlappingoverhead

• Sendandrecv overheadcanoverlap

EEL = End-to-End Latency= osend + L + orecv

EEL = f(osend, L, orecv)³ max(osend, L, orecv)

LogP Parameters:gap

• TheGapisthedelaybetweensendingmessages

• Gapcouldbegreaterthansendoverhead– NICmaybebusyfinishingtheprocessing

oflastmessageandcannotacceptanewone.

– FlowcontrolorbackpressureonthenetworkmaypreventtheNICfromacceptingthenextmessagetosend.

• NooverlapÞ timetosendnmessages(pipelined)=

osendgap

(osend + L + orecv - gap) + n*gap = α + n*β

Results:EELandOverhead

T3E/Shm

T3E/E-R

IBM/MPI

IBM/LAPI

Quadri

cs/Put

Quadri

Dolphin

Gigane

t/VIPL

Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency

Data from Mike Welcome, NERSC

SendOverheadOverTime

• Overheadhasnotimprovedsignificantly;T3Dwasbest– Lackofintegration;lackofattentioninsoftware

Myrinet2K

DolphinT3E

Cenju4

MeikoParagon

Dolphin

Myrinet

Compaq

NCube/2

1990 1992 1994 1996 1998 2000 2002Year (approximate)

Data from Kathy Yelick, UCB and NERSC

LimitationsoftheLogP Model

• TheLogP modelhasafixedcostforeachmessage– Thisisusefulinshowinghowtoquicklybroadcastasingleword– OtherexamplesalsointheLogP papers

• Forlargermessages,thereisavariationLogGP– Twogapparameters,oneforsmallandoneforlargemessages– Thelargemessagegapisthebinourpreviousmodel

• Notopologyconsiderations(includingnolimitsforbisectionbandwidth)– Assumesafullyconnectednetwork– OKforsomealgorithmswithnearestneighborcommunication,but

with“all-to-all” communicationweneedtorefinethisfurther• Thisisaflatmodel,i.e.,eachprocessorisconnectedtothe

network– Clustersofmulticoresarenotaccuratelymodeled

Summary

• Latencyandbandwidtharetwoimportantnetworkmetrics– Latencymattersmoreforsmallmessagesthanbandwidth– Bandwidthmattersmoreforlargemessagesthanbandwidth– Time=a +n*b

• Communicationhasoverheadfrombothsendingandreceivingend– EEL = End-to-End Latency = osend + L + orecv

• Multiplecommunicationcanoverlap

HistoricalPerspective

• Earlydistributedmemorymachineswere:– Collectionofmicroprocessors.– Communicationwasperformedusingbi-directionalqueues

betweennearestneighbors.• Messageswereforwardedbyprocessorsonpath.

– “Storeandforward” networking• Therewasastrongemphasisontopologyinalgorithms,inordertominimizethenumberofhops=minimizetime

EvolutionofDistributedMemoryMachines

• Specialqueueconnectionsarebeingreplacedbydirectmemoryaccess(DMA):– Processorpacksorcopiesmessages.– Initiatestransfer,goesoncomputing.

• Wormholeroutinginhardware:– Specialmessageprocessorsdonotinterruptmainprocessorsalong

path.– Longmessagesendsarepipelined.– Processorsdon’twaitforcompletemessagebeforeforwarding

• Messagepassinglibrariesprovidestore-and-forwardabstraction:– Cansend/receivebetweenanypairofnodes,notjustalongone

wire.– Timedependsondistancesinceeachprocessoralongpathmust

participate.

Outline

lecture 11: distributed memory machines and …...lecture 11: distributed memory machines and...

Documents

cache coherence in scalable machines. scalable cache...

eecc756 - shaaban #1 lec # 12 spring2004 5-6-2004 scalable...

eecc756 - shaaban #1 lec # 12 spring2000 4-25-2000 scalable...

in-memory distributed spatial query processing and...

distributed memory generator v6 -...

abdullah algarni february 23,2009. parallel architectures -...

scalable distributed memory machines: massively parallel...

eecc756 - shaaban #1 lec # 13 spring2002 5-2-2002 scalable...

scalable cache coherent systems scalable distributed shared...

1 distributed memory computers and programming. 2 outline...

scalable distributed memory...

scale-free graph processing on a numa...

distributed memory machines and programming

distributed shared memory

sparse distributed memory

scalable distributed memory machines: massively parallel...

simulation of particulate flows on multi-processor machines...

1 lecture 7: part 2: message passing multicomputers...

distributed memory machines and programming lecture 7

cache coherent distributed shared memory. motivations small...