performance of pgas models on emerging multi-...

35
Performance of PGAS Models on Emerging Multi- /Many-core Architectures using MVAPICH2-X J. Hashmi, H. Subramoni and DK Panda The Ohio State University E-mail: {hashmi.29, Subramoni.1, panda.2}@osu.edu Supercomputing’17 OSU Booth Talk

Upload: others

Post on 15-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

PerformanceofPGASModelsonEmergingMulti-/Many-coreArchitecturesusingMVAPICH2-X

J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity

E-mail:{hashmi.29,Subramoni.1,panda.2}@osu.edu

Supercomputing’17OSUBoothTalk

Page 2: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 2NetworkBasedComputingLaboratory

ParallelProgrammingModelsOverview

P1 P2 P3

SharedMemory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogicalsharedmemory

SharedMemoryModel

SHMEM,DSMDistributedMemoryModel

MPI(MessagePassingInterface)

PartitionedGlobalAddressSpace(PGAS)

GlobalArrays,UPC,Chapel,X10,CAF,…

• Programmingmodelsprovideabstractmachinemodels

• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.

• PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance

Page 3: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 3NetworkBasedComputingLaboratory

PartitionedGlobalAddressSpace(PGAS)Models

• Keyfeatures- Simplesharedmemoryabstractions

- Lightweightone-sidedcommunication

- Easiertoexpressirregularcommunication

• DifferentapproachestoPGAS- Languages

• UnifiedParallelC(UPC)

• Co-ArrayFortran(CAF)

• X10

• Chapel

- Libraries• OpenSHMEM

• UPC++

• GlobalArrays

Page 4: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 4NetworkBasedComputingLaboratory

Hybrid(MPI+PGAS)Programming

• Applicationsub-kernelscanbere-writteninMPI/PGASbasedoncommunicationcharacteristics

• Benefits:

– BestofDistributedComputingModel

– BestofSharedMemoryComputingModel

• Cons

– Twodifferentruntimes

– Needgreatcarewhileprogramming

– Pronetodeadlockifnotcareful

Kernel1MPI

Kernel2MPI

Kernel3MPI

KernelNMPI

HPCApplication

Kernel2PGAS

KernelNPGAS

Page 5: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 5NetworkBasedComputingLaboratory

OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)

– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015

– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)

• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China

• 15th,241,108-core(Pleiades)atNASA

• 20th,462,462-core(Stampede)atTACC

• 44th,74,520-core(Tsubame2.5)atTokyoInstituteofTechnology

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade

– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->

– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)

Page 6: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 6NetworkBasedComputingLaboratory

MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries

MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE

MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime

MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs

MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud

MVAPICH2-EA EnergyawareandHigh-performanceMPI

MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC

Microbenchmarks

OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs

Tools

OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration

OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications

Page 7: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 7NetworkBasedComputingLaboratory

• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X

Page 8: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 8NetworkBasedComputingLaboratory

MVAPICH2-XforHybridMPI+PGASApplications

• CurrentModel– SeparateRuntimesforOpenSHMEM/UPC/UPC++/CAFandMPI– Possibledeadlockifbothruntimesarenotprogressed

– Consumesmorenetworkresource

• UnifiedcommunicationruntimeforMPI,UPC,UPC++,OpenSHMEM,CAF– Availablewithsince2012(startingwithMVAPICH2-X1.9)– http://mvapich.cse.ohio-state.edu

Page 9: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 9NetworkBasedComputingLaboratory

ApplicationLevelPerformancewithGraph500andSortGraph500ExecutionTime

J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternationalSupercomputingConference(ISC’13),June2013

05101520253035

4K 8K 16K

Time(s)

No.ofProcesses

MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)

13X

7.6X

• PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design• 8,192processes

- 2.4X improvementoverMPI-CSR- 7.6X improvementoverMPI-Simple

• 16,384processes- 1.5X improvementoverMPI-CSR- 13X improvementoverMPI-Simple

SortExecutionTime

0

1000

2000

3000

500GB-512 1TB-1K 2TB-2K 4TB-4K

Time(secon

ds)

InputData- No.ofProcesses

MPI Hybrid

51%

• PerformanceofHybrid(MPI+OpenSHMEM)SortApplication

• 4,096processes,4TBInputSize- MPI– 2408 sec; 0.16TB/min- Hybrid – 1172 sec;0.36TB/min- 51% improvementoverMPI-design

J.Jose,S.Potluri,H.Subramoni,X.Lu,K.Hamidouche,K.Schulz,H.SundarandD.PandaDesigningScalableOut-of-coreSortingwithHybridMPI+PGASProgrammingModels,PGAS’14

Page 10: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 10NetworkBasedComputingLaboratory

AcceleratingMaTExk-NNwithHybridMPIandOpenSHMEM

KDD(2.5GB)on512cores

9.0%

KDD-tranc (30MB)on256cores

27.6%

• Benchmark:KDDCup2010(8,407,752records,2classes,k=5)• FortruncatedKDDworkloadon256cores,reduce27.6% executiontime• ForfullKDDworkloadon512cores,reduce9.0% executiontimeJ.Lin,K.Hamidouche,J.Zhang,X.Lu,A.Vishnu,D.Panda.Acceleratingk-NNAlgorithmwithHybridMPIandOpenSHMEM,OpenSHMEM2015

• MaTEx:MPI-basedMachinelearningalgorithmlibrary• k-NN:apopularsupervisedalgorithmforclassification• Hybriddesigns:

– OverlappedDataFlow;One-sidedDataTransfer;Circular-bufferStructure

Page 11: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 11NetworkBasedComputingLaboratory

• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X

Page 12: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 12NetworkBasedComputingLaboratory

IntelKnightsLanding(KNL)ProcessorArchitecture• HardwareMulti-threadedcores

– Upto72cores(model7290)

• Allcoresdividedinto36Tiles

• Eachtilecontainstwocore

– 2VPUpercore

– 1MBsharedL2cache

• 512-bitwidevectorregisters

– AVX-512extension

2 VPU

Core

2 VPU

CoreL2 Cache

(1MB)

CHA

AsingleTileofKNL

Page 13: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 13NetworkBasedComputingLaboratory

IntelKnightsLanding(KNL)Processor- MCDRAM

• On-packageMultiChannelDRAM(MCDRAM)

– 450GB/softheoreticalbandwidth(4xofDDR)

– ConfigurableinFlat,Cache,andHybridmodes

DDR MCDRAM DDR DDR MCD

RAM

MCDRAM

MCDRAM

Flatmode Cachemode Hybridmode

Page 14: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 14NetworkBasedComputingLaboratory

Motivation• OptimizingHPCprogrammingmodelsandruntimeson

emergingmulti-/many-coresisofgreatresearchinterest

• ExploringbenefitsofthearchitecturalfeaturesofmodernarchitecturesforPGASmodelsandapplications

• Characterizingandunderstanding

– theimpactofvectorizationonapplicationkernels

– MCDRAMvs.DDRperformance

– Exploitinghardwaremulti-threading

Page 15: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 15NetworkBasedComputingLaboratory

• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X

Page 16: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 16NetworkBasedComputingLaboratory

PerformanceofPGASModelsonKNLusingMVAPICH2-X

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_put

upc_putmem

upcxx_async_put

Intra-nodePUT

0.01

0.1

1

10

100

1000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_get

upc_getmem

upcxx_async_get

Intra-nodeGET

• Intra-nodeperformanceofone-sidedPut/GetoperationsofPGASlibraries/languagesusingMVAPICH2-Xcommunicationconduit

• Near-nativecommunicationperformanceisobservedonKNL

Page 17: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 17NetworkBasedComputingLaboratory

PerformanceofPGASModelsonKNLusingMVAPICH2-X

0.1

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_put

upc_putmem

upcxx_async_put

Inter-nodePUT

0.1

1

10

100

1000

10000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Latency(us)

MessageSize

shmem_get

upc_getmem

upcxx_async_get

Inter-nodeGET

• Inter-nodeperformanceofone-sidedPut/GetoperationsusingMVAPICH2-XcommunicationconduitwithInfiniBandHCA(MT4115)

• NativeIBperformanceforallthreePGASmodelsisobserved.

Page 18: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 18NetworkBasedComputingLaboratory

• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X

Page 19: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 19NetworkBasedComputingLaboratory

MicrobenchmarkEvaluations(Intra-nodePut/Get)

0.01

0.1

1

10

100

1000

Time(us)

Messagesize(bytes)

Broadwell KNL

• Broadwellshowsabout3XbetterperformancethanKNLonlargemessage• Muti-threadedmemcpyroutinesonKNLcouldoffsetthedegradationcausedby

theslowercoreonbasicPut/Getoperations

Shmem_putmem Shmem_getmem

0.01

0.1

1

10

100

1000

Time(us)

Messagesize(bytes)

Broadwell KNL312.6

81.4

331.7

96.5

J.Hashmi,M.Li,H.Subramoni,D.Panda.ExploitingandEvaluatingOpenSHMEMonKNLArchitecture,OpenSHMEM2017.

Page 20: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 20NetworkBasedComputingLaboratory

MicrobenchmarkEvaluations(Inter-nodePut/Get)

1

10

100

Time(us)

Messagesize(bytes)

Broadwell KNL

• Inter-nodesmallmessagelatencyisonly2XworseonKNL.WhilelargemessageperformanceisalmostsimilaronbothKNLandBroadwell.

Shmem_putmem Shmem_getmem

1

10

100

1000

Time(us)

Messagesize(bytes)

Broadwell KNL

2.18

1.76

2.78

2.21

Page 21: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 21NetworkBasedComputingLaboratory

1

10

100

1000

10000

Time(us)

Messagesize(bytes)

Broadwell KNL

1

10

100

1000

10000

Time(us)

Messagesize(bytes)

Broadwell KNL

MicrobenchmarkEvaluations(Collectives)

• 2KNLnodes(64ppn)and8Broadwellnodes(16ppn).• 4XdegradationisobservedonKNLusingcollectivebenchmarks.• Basicpoint-to-pointperformancedifferenceisreflectedincollectivesaswell

Shmem_reduceon128processes Shmem_broadcaston128processes

~4X ~2X

Page 22: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 22NetworkBasedComputingLaboratory

MicrobenchmarkEvaluations(Atomics)

• UsingmultiplenodesofKNL,atomicoperationsshowedabout2.5Xdegradationoncompare-swap,andIncatomics

• Fetch-and-add(32-bit)showedupto4XdegradationonKNL

0

2

4

6

8

10

12

fadd32 fadd64 cswap32 cswap64 inc32 inc64

Time(us)

KNL Broadwell

OpenSHMEMatomicson128processes

Page 23: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 23NetworkBasedComputingLaboratory

• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X

Page 24: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 24NetworkBasedComputingLaboratory

NASParallelBenchmarkEvaluationNAS-EP(RNG),CLASS=B

• AVX-512vectorizedexecutionofBTkernelonKNLshowed30%improvementoverdefaultexecutionwhileEPkerneldidn’tshowanyimprovement

• Broadwellshowed20%improvementoveroptimizedKNLonBTand4XimprovementoverallKNLexecutionsonEPkernel(randomnumbergeneration).

010203040506070

16 36 64 100

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

NAS-BT(PDEsolver),CLASS=B

0

4

8

12

16

20

16 32 64 128

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

Page 25: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 25NetworkBasedComputingLaboratory

NASParallelBenchmarkEvaluation(contd.)NAS-MG(MultiGridsolver),CLASS=B

• SimilarperformancetrendsareobservedonBTandMGkernelsaswell• OnSPkernel,MCDRAMbasedexecutionshowedupto20%improvement

overdefaultat16processes.

0

10

20

30

40

50

16 36 64 100

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

NAS-SP(non-linearPDE),CLASS=B

0

1

2

3

16 32 64 128

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

Page 26: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 26NetworkBasedComputingLaboratory

ApplicationKernelsEvaluationHeatImageKernel

• OnheatdiffusionbasedkernelsAVX-512vectorizationshowedbetterperformance• MCDRAMshowedsignificantbenefitsonHeat-Imagekernelforallprocesscounts.

CombinedwithAVX-512vectorization,itshowedupto4Ximprovedperformance

1

10

100

1000

16 32 64 128

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

Heat-2DKernelusingJacobimethod

0.1

1

10

100

16 32 64 128

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

Page 27: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 27NetworkBasedComputingLaboratory

ApplicationKernelsEvaluation(contd.)DAXPYkernel

• Vectorizationhelpsinmatrixmultiplicationandvectoroperations.• Duetoheavilycomputeboundnatureofthesekernels,MCDRAMdidn’tshowany

significantperformanceimprovement.

0.1

10

1000

16 32 64 128

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

MatrixMultiplicationkernel

1

10

100

16 32 64 128

Time(s)

No.ofprocesses

KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell

Page 28: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 28NetworkBasedComputingLaboratory

ApplicationKernelsEvaluation(contd.)

• Upto3Ximprovementonun-optimizedexecutionisobservedonKNL• Broadwellshowedupto2Xbetterperformanceforcore-by-corecomparison

0.01

0.1

1

10

100

2 4 8 16 32 64 128

Time(s)

No.ofprocesses

KNL(Default) KNL(AVX-512) KNL(AVX-512+MCDRAM) Broadwell

ScalableIntegerSortKernel(ISx)

Page 29: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 29NetworkBasedComputingLaboratory

Node-by-nodeEvaluationusingApplicationKernels

• AsinglenodeofKNLisevaluatedagainstasinglenodeofBroadwellusingalltheavailablephysicalcores

• HeatImageandISxkernels,showedbetterperformancethanIntelXeon

0.01

0.1

1

10

100

2DHeat HeatImg MatMul DAXPY ISx(strong)

Time(s)

KNL(68) Broadwell(28)

ApplicationKernelsonasingleKNLvs.singleBroadwellnode

+30%

-5%

Page 30: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 30NetworkBasedComputingLaboratory

• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin

MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-

point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM

ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X

communicationruntime

PerformanceofPGASModelsusingMVAPICH2-X

Page 31: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 31NetworkBasedComputingLaboratory

• WeusedtwoapplicationkernelstoevaluateUPC++modelusingMVAPICH2-Xascommunicationruntime

• SparseMatrixVectorMultiplication(SpMV)

• AdaptiveMeshRefinement(AMR)kernel

– 2D-HeatconductionusingJacobiiterative

• Wedesigned2D-HeatkernelusingpureUPC++asynchronousprimitivesandprovideMVAPICH2-Xbasedcommunicationsupporttoachievenear-nativeperoformance.

• Weobservednearoptimalspeed-upforthesekernelsontwoKNLnodes

UPC++ApplicationKernelsPerformanceonKNL

Page 32: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 32NetworkBasedComputingLaboratory

ApplicationKernelsPerformanceofUPC++onMVAPICH2-X

0

0.2

0.4

0.6

0.8

1

2 4 8 16 32 64 128

Time(s)

No.ofProcesses

UPC++SpMV

Strong-scalingPerformanceofSpMV kernel(2Kx2K)

0

40

80

120

2 4 8 16 32 64 128

Time(s)

No.ofProcesses

UPC++2DHeat

Strong-scalingPerformanceof2D-Heatkernel(512x512)

• Implemented2DHeatapplicationkernelsinUPC++• SpMV and2DHeatkernelsusingMVAPICH2-Xshowedgoodscalabilityon

increasingnumberofprocessesofKNLJ.Hashmi,M.Li,H.Subramoni,D.Panda.PerformanceofPGASModelsonKNL:AComprehensiveStudywithMVAPICH2-X,IXPUG2017.

Page 33: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 33NetworkBasedComputingLaboratory

PerformanceResultsSummary

Put/Get and AtomicsPerformance

Core-by-core Application Performance

CollectivesPerformance

KNL (Default)

KNL (AVX512)

KNL (AVX512 + MCDRAM)

Broadwell

(Closer to center is better)

Node-by-node Application

Performance

Page 34: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 34NetworkBasedComputingLaboratory

Conclusion• ComprehensiveperformanceevaluationofMVAPICH2-XbasedOpenSHMEM,

UPC,andUPC++modelsovertheKNLarchitecture• Observedsignificantperformancegainsonapplicationkernelswhenusing

AVX-512vectorization– 2.5xperformancebenefitsintermsofexecutiontime

• MCDRAMbenefitsarenotprominentonmostoftheapplicationkernels– Lackofmemoryboundoperations

• KNLshowedupto3XworseperformancethanBroadwellforcore-by-coreevaluation

• KNLshowedbetteroron-parperformancethanBroadwellonHeat-ImageandISxkernelsforNode-by-Nodeevaluation

• TheruntimeimplementationsneedtotakeadvantageoftheconcurrencyofKNLcores

Page 35: Performance of PGAS Models on Emerging Multi- …nowlab.cse.ohio-state.edu/static/media/talks/slide/...Performance of PGAS Models on Emerging Multi-/Many-core Architectures using MVAPICH2-X

Supercomputing‘17 35NetworkBasedComputingLaboratory

ThankYou!

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/

TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/

TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/