performance of pgas models on emerging multi-...

PerformanceofPGASModelsonEmergingMulti-/Many-coreArchitecturesusingMVAPICH2-X

J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity

E-mail:{hashmi.29,Subramoni.1,panda.2}@osu.edu

Supercomputing’17OSUBoothTalk

Supercomputing‘17 2NetworkBasedComputingLaboratory

ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

PartitionedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

• Programmingmodelsprovideabstractmachinemodels

• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

• PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance


PartitionedGlobalAddressSpace(PGAS)Models

• Keyfeatures- Simplesharedmemoryabstractions

- Lightweightone-sidedcommunication

- Easiertoexpressirregularcommunication

• DifferentapproachestoPGAS- Languages

• UnifiedParallelC(UPC)

• Co-ArrayFortran(CAF)

• X10

• Chapel

- Libraries• OpenSHMEM

• UPC++

• GlobalArrays


Hybrid(MPI+PGAS)Programming

• Applicationsub-kernelscanbere-writteninMPI/PGASbasedoncommunicationcharacteristics

• Benefits:

– BestofDistributedComputingModel

– BestofSharedMemoryComputingModel

• Cons

– Twodifferentruntimes

– Needgreatcarewhileprogramming

– Pronetodeadlockifnotcareful

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplication

Kernel2PGAS

KernelNPGAS


OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)

• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

• 44th,74,520-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)


MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries

MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE

MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime

MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs

MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud

MVAPICH2-EA EnergyawareandHigh-performanceMPI

MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC

Microbenchmarks

OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs

Tools

OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration

OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications


• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X


MVAPICH2-XforHybridMPI+PGASApplications

• CurrentModel– SeparateRuntimesforOpenSHMEM/UPC/UPC++/CAFandMPI– Possibledeadlockifbothruntimesarenotprogressed

– Consumesmorenetworkresource

• UnifiedcommunicationruntimeforMPI,UPC,UPC++,OpenSHMEM,CAF– Availablewithsince2012(startingwithMVAPICH2-X1.9)– http://mvapich.cse.ohio-state.edu


ApplicationLevelPerformancewithGraph500andSortGraph500ExecutionTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternationalSupercomputingConference(ISC’13),June2013

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

13X

7.6X

• PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design• 8,192processes

- 2.4X improvementoverMPI-CSR- 7.6X improvementoverMPI-Simple

• 16,384processes- 1.5X improvementoverMPI-CSR- 13X improvementoverMPI-Simple

SortExecutionTime

0

1000

2000

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

ds)

InputData- No.ofProcesses

MPI Hybrid

51%

• PerformanceofHybrid(MPI+OpenSHMEM)SortApplication

• 4,096processes,4TBInputSize- MPI– 2408 sec; 0.16TB/min- Hybrid – 1172 sec;0.36TB/min- 51% improvementoverMPI-design

J.Jose,S.Potluri,H.Subramoni,X.Lu,K.Hamidouche,K.Schulz,H.SundarandD.PandaDesigningScalableOut-of-coreSortingwithHybridMPI+PGASProgrammingModels,PGAS’14


AcceleratingMaTExk-NNwithHybridMPIandOpenSHMEM

KDD(2.5GB)on512cores

9.0%

KDD-tranc (30MB)on256cores

27.6%

• Benchmark:KDDCup2010(8,407,752records,2classes,k=5)• FortruncatedKDDworkloadon256cores,reduce27.6% executiontime• ForfullKDDworkloadon512cores,reduce9.0% executiontimeJ.Lin,K.Hamidouche,J.Zhang,X.Lu,A.Vishnu,D.Panda.Acceleratingk-NNAlgorithmwithHybridMPIandOpenSHMEM,OpenSHMEM2015

• MaTEx:MPI-basedMachinelearningalgorithmlibrary• k-NN:apopularsupervisedalgorithmforclassification• Hybriddesigns:

– OverlappedDataFlow;One-sidedDataTransfer;Circular-bufferStructure


IntelKnightsLanding(KNL)ProcessorArchitecture• HardwareMulti-threadedcores

– Upto72cores(model7290)

• Allcoresdividedinto36Tiles

• Eachtilecontainstwocore

– 2VPUpercore

– 1MBsharedL2cache

• 512-bitwidevectorregisters

– AVX-512extension

2 VPU

Core

2 VPU

CoreL2 Cache

(1MB)

CHA

AsingleTileofKNL


IntelKnightsLanding(KNL)Processor- MCDRAM

• On-packageMultiChannelDRAM(MCDRAM)

– 450GB/softheoreticalbandwidth(4xofDDR)

– ConfigurableinFlat,Cache,andHybridmodes

DDR MCDRAM DDR DDR MCD

RAM

MCDRAM

MCDRAM

Flatmode Cachemode Hybridmode


Motivation• OptimizingHPCprogrammingmodelsandruntimeson

emergingmulti-/many-coresisofgreatresearchinterest

• ExploringbenefitsofthearchitecturalfeaturesofmodernarchitecturesforPGASmodelsandapplications

• Characterizingandunderstanding

– theimpactofvectorizationonapplicationkernels

– MCDRAMvs.DDRperformance

– Exploitinghardwaremulti-threading


PerformanceofPGASModelsonKNLusingMVAPICH2-X

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_put

upc_putmem

upcxx_async_put

Intra-nodePUT

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_get

upc_getmem

upcxx_async_get

Intra-nodeGET

• Intra-nodeperformanceofone-sidedPut/GetoperationsofPGASlibraries/languagesusingMVAPICH2-Xcommunicationconduit

• Near-nativecommunicationperformanceisobservedonKNL


PerformanceofPGASModelsonKNLusingMVAPICH2-X

0.1

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_put

upc_putmem

upcxx_async_put

Inter-nodePUT

0.1

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_get

upc_getmem

upcxx_async_get

Inter-nodeGET

• Inter-nodeperformanceofone-sidedPut/GetoperationsusingMVAPICH2-XcommunicationconduitwithInfiniBandHCA(MT4115)

• NativeIBperformanceforallthreePGASmodelsisobserved.


MicrobenchmarkEvaluations(Intra-nodePut/Get)

0.01

0.1

1

10

100

1000

Time(us)

Messagesize(bytes)

Broadwell KNL

• Broadwellshowsabout3XbetterperformancethanKNLonlargemessage• Muti-threadedmemcpyroutinesonKNLcouldoffsetthedegradationcausedby

theslowercoreonbasicPut/Getoperations

Shmem_putmem Shmem_getmem

0.01

0.1

1

10

100

1000

Time(us)

Messagesize(bytes)

Broadwell KNL312.6

81.4

331.7

96.5

J.Hashmi,M.Li,H.Subramoni,D.Panda.ExploitingandEvaluatingOpenSHMEMonKNLArchitecture,OpenSHMEM2017.


MicrobenchmarkEvaluations(Inter-nodePut/Get)

1

10

100

Time(us)

Messagesize(bytes)

Broadwell KNL

• Inter-nodesmallmessagelatencyisonly2XworseonKNL.WhilelargemessageperformanceisalmostsimilaronbothKNLandBroadwell.

Shmem_putmem Shmem_getmem

1

10

100

1000

Time(us)

Messagesize(bytes)

Broadwell KNL

2.18

1.76

2.78

2.21


1

10

100

1000

10000

Time(us)

Messagesize(bytes)

Broadwell KNL

1

10

100

1000

10000

Time(us)

Messagesize(bytes)

Broadwell KNL

MicrobenchmarkEvaluations(Collectives)

• 2KNLnodes(64ppn)and8Broadwellnodes(16ppn).• 4XdegradationisobservedonKNLusingcollectivebenchmarks.• Basicpoint-to-pointperformancedifferenceisreflectedincollectivesaswell

Shmem_reduceon128processes Shmem_broadcaston128processes

~4X ~2X


MicrobenchmarkEvaluations(Atomics)

• UsingmultiplenodesofKNL,atomicoperationsshowedabout2.5Xdegradationoncompare-swap,andIncatomics

• Fetch-and-add(32-bit)showedupto4XdegradationonKNL

0

2

4

6

8

10

12

fadd32 fadd64 cswap32 cswap64 inc32 inc64

Time(us)

KNL Broadwell

OpenSHMEMatomicson128processes


NASParallelBenchmarkEvaluationNAS-EP(RNG),CLASS=B

• AVX-512vectorizedexecutionofBTkernelonKNLshowed30%improvementoverdefaultexecutionwhileEPkerneldidn’tshowanyimprovement

• Broadwellshowed20%improvementoveroptimizedKNLonBTand4XimprovementoverallKNLexecutionsonEPkernel(randomnumbergeneration).

010203040506070

16 36 64 100

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

NAS-BT(PDEsolver),CLASS=B

0

4

8

12

16

20

16 32 64 128

Time(s)

No.ofprocesses



NASParallelBenchmarkEvaluation(contd.)NAS-MG(MultiGridsolver),CLASS=B

• SimilarperformancetrendsareobservedonBTandMGkernelsaswell• OnSPkernel,MCDRAMbasedexecutionshowedupto20%improvement

overdefaultat16processes.

0

10

20

30

40

50

16 36 64 100

Time(s)

No.ofprocesses


NAS-SP(non-linearPDE),CLASS=B

0

1

2

3

16 32 64 128

Time(s)

No.ofprocesses



ApplicationKernelsEvaluationHeatImageKernel

• OnheatdiffusionbasedkernelsAVX-512vectorizationshowedbetterperformance• MCDRAMshowedsignificantbenefitsonHeat-Imagekernelforallprocesscounts.

CombinedwithAVX-512vectorization,itshowedupto4Ximprovedperformance

1

10

100

1000

16 32 64 128

Time(s)

No.ofprocesses


Heat-2DKernelusingJacobimethod

0.1

1

10

100

16 32 64 128

Time(s)

No.ofprocesses



ApplicationKernelsEvaluation(contd.)DAXPYkernel

• Vectorizationhelpsinmatrixmultiplicationandvectoroperations.• Duetoheavilycomputeboundnatureofthesekernels,MCDRAMdidn’tshowany

significantperformanceimprovement.

0.1

10

1000

16 32 64 128

Time(s)

No.ofprocesses


MatrixMultiplicationkernel

1

10

100

16 32 64 128

Time(s)

No.ofprocesses



ApplicationKernelsEvaluation(contd.)

• Upto3Ximprovementonun-optimizedexecutionisobservedonKNL• Broadwellshowedupto2Xbetterperformanceforcore-by-corecomparison

0.01

0.1

1

10

100

2 4 8 16 32 64 128

Time(s)

No.ofprocesses

KNL(Default) KNL(AVX-512) KNL(AVX-512+MCDRAM) Broadwell

ScalableIntegerSortKernel(ISx)


Node-by-nodeEvaluationusingApplicationKernels

• AsinglenodeofKNLisevaluatedagainstasinglenodeofBroadwellusingalltheavailablephysicalcores

• HeatImageandISxkernels,showedbetterperformancethanIntelXeon

0.01

0.1

1

10

100

2DHeat HeatImg MatMul DAXPY ISx(strong)

Time(s)

KNL(68) Broadwell(28)

ApplicationKernelsonasingleKNLvs.singleBroadwellnode

+30%

-5%


• WeusedtwoapplicationkernelstoevaluateUPC++modelusingMVAPICH2-Xascommunicationruntime

• SparseMatrixVectorMultiplication(SpMV)

• AdaptiveMeshRefinement(AMR)kernel

– 2D-HeatconductionusingJacobiiterative

• Wedesigned2D-HeatkernelusingpureUPC++asynchronousprimitivesandprovideMVAPICH2-Xbasedcommunicationsupporttoachievenear-nativeperoformance.

• Weobservednearoptimalspeed-upforthesekernelsontwoKNLnodes

UPC++ApplicationKernelsPerformanceonKNL


ApplicationKernelsPerformanceofUPC++onMVAPICH2-X

0

0.2

0.4

0.6

0.8

1

2 4 8 16 32 64 128

Time(s)

No.ofProcesses

UPC++SpMV

Strong-scalingPerformanceofSpMV kernel(2Kx2K)

0

40

80

120

2 4 8 16 32 64 128

Time(s)

No.ofProcesses

UPC++2DHeat

Strong-scalingPerformanceof2D-Heatkernel(512x512)

• Implemented2DHeatapplicationkernelsinUPC++• SpMV and2DHeatkernelsusingMVAPICH2-Xshowedgoodscalabilityon

increasingnumberofprocessesofKNLJ.Hashmi,M.Li,H.Subramoni,D.Panda.PerformanceofPGASModelsonKNL:AComprehensiveStudywithMVAPICH2-X,IXPUG2017.


PerformanceResultsSummary

Put/Get and AtomicsPerformance

Core-by-core Application Performance

CollectivesPerformance

KNL (Default)

KNL (AVX512)

KNL (AVX512 + MCDRAM)

Broadwell

(Closer to center is better)

Node-by-node Application

Performance


Conclusion• ComprehensiveperformanceevaluationofMVAPICH2-XbasedOpenSHMEM,

UPC,andUPC++modelsovertheKNLarchitecture• Observedsignificantperformancegainsonapplicationkernelswhenusing

AVX-512vectorization– 2.5xperformancebenefitsintermsofexecutiontime

• MCDRAMbenefitsarenotprominentonmostoftheapplicationkernels– Lackofmemoryboundoperations

• KNLshowedupto3XworseperformancethanBroadwellforcore-by-coreevaluation

• KNLshowedbetteroron-parperformancethanBroadwellonHeat-ImageandISxkernelsforNode-by-Nodeevaluation

• TheruntimeimplementationsneedtotakeadvantageoftheconcurrencyofKNLcores


ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/

performance of pgas models on emerging multi-...

Documents