computational scientist kaust supercomputing lab · 2018-01-31 · performance •how fast should...
TRANSCRIPT
Introduction to Numerical Libraries for HPC
ComputationalScientistKAUSTSupercomputingLab
Bilel Hadri 1
NumericalLibraries– ApplicationAreas
• Mostusedlibraries/softwareinHPC!• LinearAlgebra
– Systemsofequations• Direct,Iterative,Multigrid solvers• Sparse,Densesystem
– Eigenvalueproblems– Leastsquares
• Signalprocessing– FFT
• NumericalIntegration• RandomNumberGenerators
Bilel Hadri – Introduction to Numerical Libraries for HPC 2
NumericalLibraries- Motivations
• Don’tReinventtheWheel!• Improvesproductivity!• Getabetterperformance!
– Fasterandbetteralgorithms
Bilel Hadri – Introduction to Numerical Libraries for HPC 3
Faster (Better Code)• Achieving best performance requires creating very
processor- and system-specific code• Example: Dense matrix-matrix multiply
– Simple to express: do i=1, n
do j=1,ndo k=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
Bilel Hadri – Introduction to Numerical Libraries for HPC 4
Performance• How fast should this run?
– Our matrix-matrix multiply algorithm has 2n3 floating point operations• 3 nested loops, each with n iterations• 1 multiply, 1 add in each inner iteration
– For n=100, 2x106 operations, about 1 msec on a 2GHz processor– For n=1000, 2x109 operations, or about 1 sec
• Reality:– N=100 1.1ms– N=1000 6sà Obvious expression of algorithms are not transformed into leading performance.
Bilel Hadri – Introduction to Numerical Libraries for HPC 5
NumericalLibraries– Packages• LinearAlgebra
– BLAS/LAPACK/ScaLAPACK /PLASMA/MAGMA/HiCMA– PETSc– HYPRE– TRILINOS
• Signalprocessing– FFTW
• NumericalIntegration– GSL
• RandomNumberGenerators– SPRNG
Bilel Hadri – Introduction to Numerical Libraries for HPC 6
Others
Bilel Hadri – Introduction to Numerical Libraries for HPC 7
MUMPS4.9.2.MUMPS(MUltifrontal MassivelyParallelsparsedirectSolver)isapackageofparallel,sparse,directlinearsystemsolversbasedonamultifrontalalgorithm.http://graal.enslyon.fr/MUMPS/
SuperLU 4.3.SuperLU isasequentialversionofSuperLU_dist andasequentialincompleteLUpreconditioner thatcanacceleratetheconvergenceofKrylovsubspaceiterativesolvers.http://crd.lbl.gov/~xiaoye/SuperLU/
ParMETIS 4.0.2.ParMETIS (ParallelGraphPartitioningandFillreducingMatrixOrdering)isalibraryofroutinesthatpartitionunstructuredgraphsandmeshesandcomputefillreducing orderingsofsparsematrices.http://glaros.dtc.umn.edu/gkhome/views/metis/
Bilel Hadri – Introduction to Numerical Libraries for HPC 8
• SUNDIALS 2.5.0 (SUite of Nonlinear and DIfferential/Algebraic equationSolvers) consists of 5 solvers: CVODE, CVODES, IDA, IDAS, and KINSOL. Inaddition, SUNDIALS provides a MATLAB interface to CVODES, IDAS, andKINSOL that is called sundialsTB.
https://computation.llnl.gov/casc/sundials/main.html
• Scotch 5.1.12b Scotch is a software package and libraries for sequentialand parallel graph partitioning, static mapping, sparse matrix blockordering, and sequential mesh and hypergraph partitioning.
http://www.labri.fr/perso/pelegrin/scotch
Note: On Shaheen , they are all grouped into cray-tpsl
• FreelyAvailableSoftwareforLinearAlgebrahttp://www.netlib.org/utk/people/JackDongarra/la-sw.html
BLAS(BasicLinearAlgebraSubprograms)
Bilel Hadri – Introduction to Numerical Libraries for HPC 9
The BLAS functionality is divided into three levels: • Level 1: contains vector operations of the form:
• Level 2: contains matrix-vector operations of the form:
• Level 3: contains matrix-matrix operations of the form:
• Several implementations for different languages exist– Reference implementation http://www.netlib.org/blas/
BLAS:namingconventions• Eachroutinehasanamewhichspecifiestheoperation,thetypeofmatricesinvolved
andtheirprecisions.Namesareintheform:PMMOO”.
Bilel Hadri – Introduction to Numerical Libraries for HPC 10
– Each operation is defined for four precisions (P) • S single real
D double real C single complex Z double complex
– The types of matrices are (MM) • GE general
GB general band SY symmetric SB symmetric band SP symmetric packed HE hermitianHB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed
– Some of the most common operations (OO):• DOT scalar product, x^T y
AXPY vector sum, α x + y MV matrix-vector product, A x SV matrix-vector solve, inv(A) x MM matrix-matrix product, A B SM matrix-matrix solve, inv(A) B
• ExamplesSGEMM stands for “single-precision general matrix-matrix multiply”DGEMM stands for “double-precision matrix-matrix multiply”.
BLASLevel1routines• Vectoroperations
(xROT,xSWAP,xCOPYetc.)• Scalar dotproducts
(xDOTetc.)• Vectornorms
(IxAMXetc.)
Bilel Hadri – Introduction to Numerical Libraries for HPC 11
BLASLevel2routines
• Matrix-vectoroperations(xGEMV,xGBMV,xHEMV,xHBMV etc.)
• SolvingTx =y forx,whereTistriangular(xGER,xHER etc.)
Bilel Hadri – Introduction to Numerical Libraries for HPC 12
BLASLevel3routines• Matrix-matrixoperations(xGEMM etc.)• Solvingfortriangularmatrices(xTRMM)• Widelyusedmatrix-matrixmultiply(xSYMM,xGEMM)
Bilel Hadri – Introduction to Numerical Libraries for HPC 13
LAPACK(LinearAlgebraPACKage)• LinearAlgebraPACKage
– http://www.netlib.org/lapack/– Providesroutinesfor
• Solvingsystemsofsimultaneouslinearequations,• Least-squaressolutionsoflinearsystemsofequations,• Eigenvalue problems,• HouseholdertransformationtoimplementQRdecompositiononamatrixand• Singularvalueproblems
– Wasinitiallydesignedtorunefficientlyonsharedmemoryvector machines– DependsonBLAS– HasbeenextendedfordistributedsystemsScaLAPACK(ScalableLinear
AlgebraPACKage)
Bilel Hadri – Introduction to Numerical Libraries for HPC 14
LAPACKnamingconventions• VerysimilartoBLAS
– XYYZZZ• X:datatype
– S:REAL– D:DOUBLEPRECISION– C:COMPLEX– Z:COMPLEX*16orDOUBLECOMPLEX
• YY:matrixtype– BD:bidiagonal– DI:diagonal– GB:generalband– GE:general(i.e.,unsymmetric,insomecases
rectangular)– GG:generalmatrices,generalizedproblem
(i.e.,apairofgeneralmatrices)– GT:generaltridiagonal– HB:(complex)Hermitian band– HE:(complex)Hermitian– HG:upperHessenberg matrix,generalized
problem(i.e aHessenberg anda triangularmatrix)
– HP:(complex)Hermitian,packedstorage– HS:upperHessenberg– OP:(real)orthogonal,packedstorage– OR:(real)orthogonal– PB:symmetricorHermitian positivedefinite
band
• YY:morematrixtypes– PO:symmetricorHermitian positivedefinite– PP:symmetricorHermitian positivedefinite,
packedstorage– PT:symmetricorHermitian positivedefinite
tridiagonal– SB:(real)symmetricband– SP:symmetric,packedstorage– ST:(real)symmetrictridiagonal– SY:symmetric– TB:triangularband– TG:triangularmatrices,generalizedproblem
(i.e.,apairoftriangularmatrices)– TP:triangular,packedstorage– TR:triangular(orinsomecasesquasi-
triangular)– TZ:trapezoidal– UN:(complex)unitary– UP:(complex)unitary,packedstorage
• ZZZ:performedcomputation– Linearsystems– Factorizations– Eigenvalueproblems– Singularvaluedecomposition– Etc.
Bilel Hadri – Introduction to Numerical Libraries for HPC 15
LAPACKroutines
Bilel Hadri – Introduction to Numerical Libraries for HPC 16
http://www.icl.utk.edu/~mgates3/docs/lapack-summary.pdf
NumericalLibrariespackages
• VendorlibrariesoptimizedimplementationsofBLAS,LAPACK,ScaLAPACK totheirprocessorsandtheirplatform.
Bilel Hadri – Introduction to Numerical Libraries for HPC 17
Dongarra/ICL
LAPACK&ScaLAPACK
• ScaLAPACK isalibrarywithasubsetofLAPACKroutinesrunningondistributedmemorymachines.
• ScaLAPACK isdesignedforheterogeneouscomputingandispotableonanycomputerthatsupportsMPIorPVM.
• http://www.netlib.org/scalapack/scalapack_home.html
Bilel Hadri – Introduction to Numerical Libraries for HPC 18
Bilel Hadri – Introduction to Numerical Libraries for HPC 20
WhyuseLAPACKorScaLAPACK?
• Solvingsystemsof:– Linearequations:– Leastsquares:– Eigenvalue problem:– Singularvalueproblem:
2||||min bAx-
bAx =
TVUA S=
𝐴𝑥 = 𝜆𝑥
ReferenceBLASvs Tuned• ThereferenceBLASandLAPACKlibrariesarereferenceimplementationsoftheBLASand
LAPACKstandard.Thesearenotoptimisedandnotmulti-threaded,sonotmuchperformanceshouldbeexpected.Theselibrariesareavailablefordownloadhttp://www.netlib.org/blas andhttp://www.netlib.org/lapack
• TheAutomaticallyTunedLinearAlgebraSoftware,ATLAS.Duringcompiletime,ATLASautomaticallychosesthealgorithmsdeliveringthebestperformance.ATLASdoesnotcontainallLAPACKfunctionality;itcanbedownloadedfromhttp://www.netlib.org/atlas
• TheGotoBLASanimplementationofthelevel3BLASaimedathighefficiency].TheGotoBLASisavailablefordownloadfromhttp://www.tacc.utexas.edu/resources/software
Bilel Hadri – Introduction to Numerical Libraries for HPC 21
OptimizedvendorlibrariesforBLAS/LAPACK
• Highlyefficientversions• Handtunedassemblybyhardwarevendors• Providenearpeakperformance• Severalvendorsprovidelibrariesoptimizedfortheirarchitecture(AMD,Fujitsu,IBM,Intel,NEC,…)– IntelàMKL– Crayà LibSci– AMDà ACML– IBMà ESSL
• USEthem!(Speedupupto10ormore)Bilel Hadri – Introduction to Numerical Libraries for HPC 22
AMD/MKL• ACML(AMDCoreMathLibrary)
– LAPACK,BLAS,andextendedBLAS(sparse),FFTs(single- anddouble-precision,realandcomplexdatatypes).
– APIsforbothFortranandC– https://developer.amd.com/amd-open64-software-development-
kit/building-with-acml/
• MKL(MathKernelLibrary)– LAPACK,BLAS,andextendedBLAS(sparse),FFTs(single- anddouble-precision,
realandcomplexdatatypes).– APIsforbothFortranandC– www.intel.com/software/products/mkl/– UsetheMKLadvisorypagetolinkyourcodewithit:
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
Bilel Hadri – Introduction to Numerical Libraries for HPC
ExamplewithSGEMM
Bilel Hadri – Introduction to Numerical Libraries for HPC 24http://homepage.ntu.edu.tw/~wttsai/fortran/
Fortranexample
Bilel Hadri – Introduction to Numerical Libraries for HPC 25
Sourceavailableathttp://homepage.ntu.edu.tw/~wttsai/fortran/
LinkingexamplesLibrary Compiler Link-flags
LIBSCIonCray
Cray Environmentbydefault:compilewithoutaddingflagsGNU compilewithoutaddingflagsIntel compilewithoutaddingflags
ACMLGNU /opt/acml/4.4.0/gfortran64_mp/lib/libacml_mp.a –fopenmpIntel /opt/acml/4.4.0/ifort64_mp/lib/libacml_mp.a-openmp–lpthread
MKL
PGI-Wl,--start-group/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_pgi_thread.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a-Wl,--end-group-mp-lpthread
GNU
-Wl,--start-group/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_gnu_thread.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a-Wl,--end-group-L/opt/intel/Compiler/11.1/038/lib/intel64/-liomp5
Intel-Wl,--start-group/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_lp64.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_intel_thread.a/opt/intel/Compiler/11.1/038/mkl/lib/em64t/libmkl_core.a -Wl,--end-group-openmp –lpthread
Bilel Hadri – Introduction to Numerical Libraries for HPC 26
• UsetheMKLadvisorylinkline:• http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor
Compilationdemos• OnShaheen:Use-Wl,-ysgemm_tocheckwithoptimizedlibraryisused.
– withCray-libsci• ftn –oexe_libsci test_sgemm.f90
– withIntelMKL• Unloadcray-lisci• ftn –oexe_libsci test_sgemm.f90 -Wl,--start-group
${MKLROOT}/lib/intel64/libmkl_intel_lp64.a${MKLROOT}/lib/intel64/libmkl_intel_thread.a${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group-liomp5-lpthread -lm-ldl
• OnIbex– withBLAS
• moduleloadblas/3.7.1/gnu-6.4.0• -gfortran test_sgemm.f90 -lblas
– Intel• moduleloadintel/2017• ifort -oexe_mkl test_sgemm.f90-Wl,--start-group${MKLROOT}/lib/intel64/libmkl_intel_lp64.a
${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group-liomp5-lpthread -lm-ldl
– ACML• moduleloadacml/6.1.0.31-gfortran64• gfortran -oexe_acml test_sgemm.f90-lacml_mp
Bilel Hadri – Introduction to Numerical Libraries for HPC 27
Pythonfans!• Youcanspeedupyourpythonscripts:
– Byusingthescientificlibrariesnumpy andscipy– Buildpythonwiththevendoroptimizedlibrary
• Availablewithpython/2.7.11onShaheen• Youcanbuildyourownbyfollowingtheinstructions:
– https://software.intel.com/en-us/node/696338• withcray-libsci:
– AvailablenextmonthonShaheen.cray-python/17.09• Checkinstallation:
– importnumpy asnp;– np.show_config()
Bilel Hadri – Introduction to Numerical Libraries for HPC 28
PythonNumpy checkinstallation
Bilel Hadri – Introduction to Numerical Libraries for HPC 29
import numpy as np;>>> np.show_config()lapack_opt_info:
libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']
blas_opt_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']
openblas_lapack_info:NOT AVAILABLE
lapack_mkl_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']
blas_mkl_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']
mkl_info:libraries = ['mkl_rt', 'pthread']library_dirs = ['/opt/intel/composer_xe_2015.2.164/mkl/lib/intel64']define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]include_dirs = ['/opt/intel/compilers_and_libraries_2016/linux/mkl/include']
Performanceformulae• Performanceismeasuredinfloatingpointoperationspersecond,FLOPS,or
FLOP/s.• CurrentprocessorsdeliveranRpeakintheGFLOPS(109 FLOPS)range.• TheRpeak ofasystemcanbecomputedby:
Rpeak =nCPU ·ncore ·nFPU ·f
– nCPU isthenumberofCPUsinthesystem,– ncore isthenumberofcomputingcoresperCPU,– nFPU isthenumberoffloatingpointunitspercore,– fistheclockfrequency
Bilel Hadri – Introduction to Numerical Libraries for HPC 31
FLOPscountsforrecentprocessormicroarchitectures• IntelCore2andNehalem:
– 4DPFLOPs/cycle:2-wideSSE2addition+2-wideSSE2multiplication
– 8SPFLOPs/cycle:4-wideSSEaddition+4-wideSSEmultiplication
• IntelSandyBridge/IvyBridge:– 8DPFLOPs/cycle:4-wideAVXaddition+4-wide
AVXmultiplication– 16SPFLOPs/cycle:8-wideAVXaddition+8-
wideAVXmultiplication• IntelHaswell:
– 16DPFLOPs/cycle:two4-wideFMA(fusedmultiply-add)instructions
– 32SPFLOPs/cycle:two8-wideFMA(fusedmultiply-add)instructions
• IntelSkylake/Knight’sLandingAVX512– 32flops/cycleDP&– 64flops/cycleSP
• AMDK10:– 4DPFLOPs/cycle:2-wideSSE2addition+2-
wideSSE2multiplication– 8SPFLOPs/cycle:4-wideSSEaddition+4-wide
SSEmultiplication
• AMDBulldozer:– 8DPFLOPs/cycle:4-wideFMA– 16SPFLOPs/cycle:8-wideFMA
• ARMCortex-A15:– 2DPFLOPs/cycle:scalarFMAorscalarmultiply-
add– 8SPFLOPs/cycle:4-wideNEONv2FMAor4-
wideNEONmultiply-add• IBMPowerPCA2(BlueGeneQ):
– 8DPFLOPs/cycle:4-wideQPXFMA– SPelementsareextendedtoDPandprocessed
onthesameunits• IntelMIC(XeonPhi),percore(supports4
hyperthreads):– 16DPFLOPs/cycle:8-wideFMAeverycycle– 32SPFLOPs/cycle:16-wideFMAeverycycle
• IntelMIC(XeonPhi),perthread:– 8DPFLOPs/cycle:8-wideFMAeveryothercycle– 16SPFLOPs/cycle:16-wideFMAeveryother
cycle
Bilel Hadri – Introduction to Numerical Libraries for HPC 32
Rooflines
• Rooflineisaperformancemodelusedtoboundtheperformanceofvariousnumericalmethodsandoperationsrunningonprocessorarchitectures
Bilel Hadri – Introduction to Numerical Libraries for HPC 33LorenaA.Barba,RioYokota
BestPractices• NumericalRecipesbooksDONOTprovideoptimizedcode.
– (Librariescanbe100xfaster).
• Don’treinventthewheel.• Useoptimizedlibraries!• It’snotonlyforC++/C/Fortran.
– PythonhasaninterfacewithBLAS(checkwithnumpy/scipy)– R,Matlab,cython….
• Don’tforgettheenvironmentvariables!• Theefficientuseofnumericallibrariescanyieldsignificant
performancebenefits– Shouldbeoneofthefirstthingstoinvestigatewhenoptimizingcodes– Thebestlibraryimplementationoftenvariesdependingontheindividualroutine
andpossiblyeventhesizeofinputdata– READthemanualand/orattendthetutorials/workshops!
Bilel Hadri – Introduction to Numerical Libraries for HPC 34