computational scientist kaust supercomputing lab · 2018-01-31 · performance •how fast should...

Introduction to Numerical Libraries for HPC

[email protected]

ComputationalScientistKAUSTSupercomputingLab

Bilel Hadri 1

NumericalLibraries– ApplicationAreas

• Mostusedlibraries/softwareinHPC!• LinearAlgebra

– Systemsofequations• Direct,Iterative,Multigrid solvers• Sparse,Densesystem

– Eigenvalueproblems– Leastsquares

• Signalprocessing– FFT

• NumericalIntegration• RandomNumberGenerators

Bilel Hadri – Introduction to Numerical Libraries for HPC 2

NumericalLibraries- Motivations

• Don’tReinventtheWheel!• Improvesproductivity!• Getabetterperformance!

– Fasterandbetteralgorithms


Faster (Better Code)• Achieving best performance requires creating very

processor- and system-specific code• Example: Dense matrix-matrix multiply

– Simple to express: do i=1, n

do j=1,ndo k=1,n

c(i,j) = c(i,j) + a(i,k) * b(k,j)

enddo

enddo

enddo


Performance• How fast should this run?

– Our matrix-matrix multiply algorithm has 2n3 floating point operations• 3 nested loops, each with n iterations• 1 multiply, 1 add in each inner iteration

– For n=100, 2x106 operations, about 1 msec on a 2GHz processor– For n=1000, 2x109 operations, or about 1 sec

• Reality:– N=100 1.1ms– N=1000 6sà Obvious expression of algorithms are not transformed into leading performance.


NumericalLibraries– Packages• LinearAlgebra

– BLAS/LAPACK/ScaLAPACK /PLASMA/MAGMA/HiCMA– PETSc– HYPRE– TRILINOS

• Signalprocessing– FFTW

• NumericalIntegration– GSL

• RandomNumberGenerators– SPRNG


Others


MUMPS4.9.2.MUMPS(MUltifrontal MassivelyParallelsparsedirectSolver)isapackageofparallel,sparse,directlinearsystemsolversbasedonamultifrontalalgorithm.http://graal.enslyon.fr/MUMPS/

SuperLU 4.3.SuperLU isasequentialversionofSuperLU_dist andasequentialincompleteLUpreconditioner thatcanacceleratetheconvergenceofKrylovsubspaceiterativesolvers.http://crd.lbl.gov/~xiaoye/SuperLU/

ParMETIS 4.0.2.ParMETIS (ParallelGraphPartitioningandFillreducingMatrixOrdering)isalibraryofroutinesthatpartitionunstructuredgraphsandmeshesandcomputefillreducing orderingsofsparsematrices.http://glaros.dtc.umn.edu/gkhome/views/metis/


• SUNDIALS 2.5.0 (SUite of Nonlinear and DIfferential/Algebraic equationSolvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. Inaddition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, andKINSOL that is called sundialsTB.

https://computation.llnl.gov/casc/sundials/main.html

• Scotch 5.1.12b Scotch is a software package and libraries for sequentialand parallel graph partitioning, static mapping, sparse matrix blockordering, and sequential mesh and hypergraph partitioning.

http://www.labri.fr/perso/pelegrin/scotch

Note: On Shaheen , they are all grouped into cray-tpsl

• FreelyAvailableSoftwareforLinearAlgebrahttp://www.netlib.org/utk/people/JackDongarra/la-sw.html

BLAS(BasicLinearAlgebraSubprograms)


The BLAS functionality is divided into three levels: • Level 1: contains vector operations of the form:

• Level 2: contains matrix-vector operations of the form:

• Level 3: contains matrix-matrix operations of the form:

• Several implementations for different languages exist– Reference implementation http://www.netlib.org/blas/

BLAS:namingconventions• Eachroutinehasanamewhichspecifiestheoperation,thetypeofmatricesinvolved

andtheirprecisions.Namesareintheform:PMMOO”.


– Each operation is defined for four precisions (P) • S single real

D double real C single complex Z double complex

– The types of matrices are (MM) • GE general

GB general band SY symmetric SB symmetric band SP symmetric packed HE hermitianHB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed

– Some of the most common operations (OO):• DOT scalar product, x^T y

AXPY vector sum, α x + y MV matrix-vector product, A x SV matrix-vector solve, inv(A) x MM matrix-matrix product, A B SM matrix-matrix solve, inv(A) B

• ExamplesSGEMM stands for “single-precision general matrix-matrix multiply”DGEMM stands for “double-precision matrix-matrix multiply”.

BLASLevel1routines• Vectoroperations

(xROT,xSWAP,xCOPYetc.)• Scalar dotproducts

(xDOTetc.)• Vectornorms

(IxAMXetc.)


BLASLevel2routines

• Matrix-vectoroperations(xGEMV,xGBMV,xHEMV,xHBMV etc.)

• SolvingTx =y forx,whereTistriangular(xGER,xHER etc.)


BLASLevel3routines• Matrix-matrixoperations(xGEMM etc.)• Solvingfortriangularmatrices(xTRMM)• Widelyusedmatrix-matrixmultiply(xSYMM,xGEMM)


LAPACK(LinearAlgebraPACKage)• LinearAlgebraPACKage

– http://www.netlib.org/lapack/– Providesroutinesfor

• Solvingsystemsofsimultaneouslinearequations,• Least-squaressolutionsoflinearsystemsofequations,• Eigenvalue problems,• HouseholdertransformationtoimplementQRdecompositiononamatrixand• Singularvalueproblems

– Wasinitiallydesignedtorunefficientlyonsharedmemoryvector machines– DependsonBLAS– HasbeenextendedfordistributedsystemsScaLAPACK(ScalableLinear

AlgebraPACKage)


LAPACKnamingconventions• VerysimilartoBLAS

– XYYZZZ• X:datatype

– S:REAL– D:DOUBLEPRECISION– C:COMPLEX– Z:COMPLEX*16orDOUBLECOMPLEX

• YY:matrixtype– BD:bidiagonal– DI:diagonal– GB:generalband– GE:general(i.e.,unsymmetric,insomecases

rectangular)– GG:generalmatrices,generalizedproblem

(i.e.,apairofgeneralmatrices)– GT:generaltridiagonal– HB:(complex)Hermitian band– HE:(complex)Hermitian– HG:upperHessenberg matrix,generalized

problem(i.e aHessenberg anda triangularmatrix)

– HP:(complex)Hermitian,packedstorage– HS:upperHessenberg– OP:(real)orthogonal,packedstorage– OR:(real)orthogonal– PB:symmetricorHermitian positivedefinite

band

• YY:morematrixtypes– PO:symmetricorHermitian positivedefinite– PP:symmetricorHermitian positivedefinite,

packedstorage– PT:symmetricorHermitian positivedefinite

tridiagonal– SB:(real)symmetricband– SP:symmetric,packedstorage– ST:(real)symmetrictridiagonal– SY:symmetric– TB:triangularband– TG:triangularmatrices,generalizedproblem

(i.e.,apairoftriangularmatrices)– TP:triangular,packedstorage– TR:triangular(orinsomecasesquasi-

triangular)– TZ:trapezoidal– UN:(complex)unitary– UP:(complex)unitary,packedstorage

• ZZZ:performedcomputation– Linearsystems– Factorizations– Eigenvalueproblems– Singularvaluedecomposition– Etc.


LAPACKroutines


http://www.icl.utk.edu/~mgates3/docs/lapack-summary.pdf

NumericalLibrariespackages

• VendorlibrariesoptimizedimplementationsofBLAS,LAPACK,ScaLAPACK totheirprocessorsandtheirplatform.


Dongarra/ICL

LAPACK&ScaLAPACK

• ScaLAPACK isalibrarywithasubsetofLAPACKroutinesrunningondistributedmemorymachines.

• ScaLAPACK isdesignedforheterogeneouscomputingandispotableonanycomputerthatsupportsMPIorPVM.

• http://www.netlib.org/scalapack/scalapack_home.html


OverviewofScaLAPACK



WhyuseLAPACKorScaLAPACK?

• Solvingsystemsof:– Linearequations:– Leastsquares:– Eigenvalue problem:– Singularvalueproblem:

2||||min bAx-

bAx =

TVUA S=

𝐴𝑥 = 𝜆𝑥

ReferenceBLASvs Tuned• ThereferenceBLASandLAPACKlibrariesarereferenceimplementationsoftheBLASand

LAPACKstandard.Thesearenotoptimisedandnotmulti-threaded,sonotmuchperformanceshouldbeexpected.Theselibrariesareavailablefordownloadhttp://www.netlib.org/blas andhttp://www.netlib.org/lapack

• TheAutomaticallyTunedLinearAlgebraSoftware,ATLAS.Duringcompiletime,ATLASautomaticallychosesthealgorithmsdeliveringthebestperformance.ATLASdoesnotcontainallLAPACKfunctionality;itcanbedownloadedfromhttp://www.netlib.org/atlas

• TheGotoBLASanimplementationofthelevel3BLASaimedathighefficiency].TheGotoBLASisavailablefordownloadfromhttp://www.tacc.utexas.edu/resources/software


OptimizedvendorlibrariesforBLAS/LAPACK

• Highlyefficientversions• Handtunedassemblybyhardwarevendors• Providenearpeakperformance• Severalvendorsprovidelibrariesoptimizedfortheirarchitecture(AMD,Fujitsu,IBM,Intel,NEC,…)– IntelàMKL– Crayà LibSci– AMDà ACML– IBMà ESSL

• USEthem!(Speedupupto10ormore)Bilel Hadri – Introduction to Numerical Libraries for HPC 22

AMD/MKL• ACML(AMDCoreMathLibrary)

– LAPACK,BLAS,andextendedBLAS(sparse),FFTs(single- anddouble-precision,realandcomplexdatatypes).

– APIsforbothFortranandC– https://developer.amd.com/amd-open64-software-development-

kit/building-with-acml/

• MKL(MathKernelLibrary)– LAPACK,BLAS,andextendedBLAS(sparse),FFTs(single- anddouble-precision,

realandcomplexdatatypes).– APIsforbothFortranandC– www.intel.com/software/products/mkl/– UsetheMKLadvisorypagetolinkyourcodewithit:

https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Bilel Hadri – Introduction to Numerical Libraries for HPC

ExamplewithSGEMM

Bilel Hadri – Introduction to Numerical Libraries for HPC 24http://homepage.ntu.edu.tw/~wttsai/fortran/

Fortranexample


Sourceavailableathttp://homepage.ntu.edu.tw/~wttsai/fortran/

LinkingexamplesLibrary Compiler Link-flags

LIBSCIonCray

Cray Environmentbydefault:compilewithoutaddingflagsGNU compilewithoutaddingflagsIntel compilewithoutaddingflags

ACMLGNU /opt/acml/4.4.0/gfortran64_mp/lib/libacml_mp.a –fopenmpIntel /opt/acml/4.4.0/ifort64_mp/lib/libacml_mp.a-openmp–lpthread

MKL

PGI-Wl,--start-group/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_pgi_thread.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a-Wl,--end-group-mp-lpthread

GNU

-Wl,--start-group/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_gnu_thread.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a-Wl,--end-group-L/opt/intel/Compiler/11.1/038/lib/intel64/-liomp5

Intel-Wl,--start-group/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_thread.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group-openmp –lpthread


• UsetheMKLadvisorylinkline:• http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Compilationdemos• OnShaheen:Use-Wl,-ysgemm_tocheckwithoptimizedlibraryisused.

– withCray-libsci• ftn –oexe_libsci test_sgemm.f90

– withIntelMKL• Unloadcray-lisci• ftn –oexe_libsci test_sgemm.f90 -Wl,--start-group

${MKLROOT}/lib/intel64/libmkl_intel_lp64.a${MKLROOT}/lib/intel64/libmkl_intel_thread.a${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group-liomp5-lpthread -lm-ldl

• OnIbex– withBLAS

• moduleloadblas/3.7.1/gnu-6.4.0• -gfortran test_sgemm.f90 -lblas

– Intel• moduleloadintel/2017• ifort -oexe_mkl test_sgemm.f90-Wl,--start-group${MKLROOT}/lib/intel64/libmkl_intel_lp64.a

${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group-liomp5-lpthread -lm-ldl

– ACML• moduleloadacml/6.1.0.31-gfortran64• gfortran -oexe_acml test_sgemm.f90-lacml_mp


Pythonfans!• Youcanspeedupyourpythonscripts:

– Byusingthescientificlibrariesnumpy andscipy– Buildpythonwiththevendoroptimizedlibrary

• Availablewithpython/2.7.11onShaheen• Youcanbuildyourownbyfollowingtheinstructions:

– https://software.intel.com/en-us/node/696338• withcray-libsci:

– AvailablenextmonthonShaheen.cray-python/17.09• Checkinstallation:

– importnumpy asnp;– np.show_config()


PythonNumpy checkinstallation


import numpy as np;>>> np.show_config()lapack_opt_info:

libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']

blas_opt_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']

openblas_lapack_info:NOT AVAILABLE

lapack_mkl_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']

blas_mkl_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']

mkl_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']

PythonuseNumpy andScipy demo


Performanceformulae• Performanceismeasuredinfloatingpointoperationspersecond,FLOPS,or

FLOP/s.• CurrentprocessorsdeliveranRpeakintheGFLOPS(109 FLOPS)range.• TheRpeak ofasystemcanbecomputedby:

Rpeak =nCPU ·ncore ·nFPU ·f

– nCPU isthenumberofCPUsinthesystem,– ncore isthenumberofcomputingcoresperCPU,– nFPU isthenumberoffloatingpointunitspercore,– fistheclockfrequency


FLOPscountsforrecentprocessormicroarchitectures• IntelCore2andNehalem:

– 4DPFLOPs/cycle:2-wideSSE2addition+2-wideSSE2multiplication

– 8SPFLOPs/cycle:4-wideSSEaddition+4-wideSSEmultiplication

• IntelSandyBridge/IvyBridge:– 8DPFLOPs/cycle:4-wideAVXaddition+4-wide

AVXmultiplication– 16SPFLOPs/cycle:8-wideAVXaddition+8-

wideAVXmultiplication• IntelHaswell:

– 16DPFLOPs/cycle:two4-wideFMA(fusedmultiply-add)instructions

– 32SPFLOPs/cycle:two8-wideFMA(fusedmultiply-add)instructions

• IntelSkylake/Knight’sLandingAVX512– 32flops/cycleDP&– 64flops/cycleSP

• AMDK10:– 4DPFLOPs/cycle:2-wideSSE2addition+2-

wideSSE2multiplication– 8SPFLOPs/cycle:4-wideSSEaddition+4-wide

SSEmultiplication

• AMDBulldozer:– 8DPFLOPs/cycle:4-wideFMA– 16SPFLOPs/cycle:8-wideFMA

• ARMCortex-A15:– 2DPFLOPs/cycle:scalarFMAorscalarmultiply-

add– 8SPFLOPs/cycle:4-wideNEONv2FMAor4-

wideNEONmultiply-add• IBMPowerPCA2(BlueGeneQ):

– 8DPFLOPs/cycle:4-wideQPXFMA– SPelementsareextendedtoDPandprocessed

onthesameunits• IntelMIC(XeonPhi),percore(supports4

hyperthreads):– 16DPFLOPs/cycle:8-wideFMAeverycycle– 32SPFLOPs/cycle:16-wideFMAeverycycle

• IntelMIC(XeonPhi),perthread:– 8DPFLOPs/cycle:8-wideFMAeveryothercycle– 16SPFLOPs/cycle:16-wideFMAeveryother

cycle


Rooflines

• Rooflineisaperformancemodelusedtoboundtheperformanceofvariousnumericalmethodsandoperationsrunningonprocessorarchitectures

Bilel Hadri – Introduction to Numerical Libraries for HPC 33LorenaA.Barba,RioYokota

BestPractices• NumericalRecipesbooksDONOTprovideoptimizedcode.

– (Librariescanbe100xfaster).

• Don’treinventthewheel.• Useoptimizedlibraries!• It’snotonlyforC++/C/Fortran.

– PythonhasaninterfacewithBLAS(checkwithnumpy/scipy)– R,Matlab,cython….

• Don’tforgettheenvironmentvariables!• Theefficientuseofnumericallibrariescanyieldsignificant

performancebenefits– Shouldbeoneofthefirstthingstoinvestigatewhenoptimizingcodes– Thebestlibraryimplementationoftenvariesdependingontheindividualroutine

andpossiblyeventhesizeofinputdata– READthemanualand/orattendthetutorials/workshops!



THANKS

computational scientist kaust supercomputing lab · 2018-01-31 · performance •how fast should...

Documents