tau-kernel linsolv (gaspi+openmp) - intertwine-project.eu · tau-kernel linsolv (gaspi+openmp)...

Copyright INTERTWinE Consortium 2018 1 TAU-kernel linsolv (GASPI+OpenMP) Introduction TAU-kernel linsolv implements several (iterative) methods to find an approximate solution of the linear system = , where is a sparse block matrix of dimension (number of grid points), with square blocks !" of dimension (usually = 6). Available solution methods are: • (PNT) Point implicit: ! ! = !! !! ! , = 1, … , • (JAC) Jacobi: ! (!!!) = !! !! ! − !" ! (!) ! !!!,!!! , = 1, … , • (GS) Gauss-Seidel: ! (!!!) = !! !! ! − !" ! (!!!) !!! !!! − !" ! (!) ! !!!!! , = 1, … , • (SGS) Symmetric Gauss-Seidel: ! (!!!) = !! !! ! − !" ! (!!!) !!! !!! − !" ! (!) ! !!!!! , = 1, … , , , … ,1 As a preliminary step for all methods, the LU decomposition of the diagonal blocks of A is calculated (LU), and used for the solution of the small systems !! ! . In TAU these methods are used to construct a preconditioner for the Runge-Kutta scheme which does not require exact solution of the linear system, usually only a few iterations are performed. Motivation By using a hybrid parallelization approach we hope to achieve better scalability due to minimized communication needs as well as better load balancing capabilities. Implementation details and performance results GASPI parallelization The parallelization is based on a domain decomposition of the computational grid. Matrix A is decomposed row-wise according to the mapping of grid points to subdomains. Each subdomain is assigned to one GASPI process and is first solved as an individual problem. For JAC and GS, the halo part of the approximate solution is then communicated after each completed sweep; for SGS, after each forward and backward sweep. For LU and PNT, only local data is involved, so no communication is needed. OpenMP parallelization For LU, JAC and PNT, all iterations = 1 … are completely independent and are, thus, parallelized using #pragma omp for directives. For GS and SGS, iterations over rows have been manually split into nthread consecutive chunks (i.e. local subdomains). During a sweep each thread then calculates and updates only its own part of , using other parts of from the previous sweep. All threads participate in GASPI communication.

Upload: others

Post on 27-Sep-2020

13 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: TAU-kernel linsolv (GASPI+OpenMP) - intertwine-project.eu · TAU-kernel linsolv (GASPI+OpenMP) Introduction TAU-kernel linsolv implements several (iterative) methods to find an approximate

TAU-kernellinsolv(GASPI+OpenMP)IntroductionTAU-kernel linsolv implements several (iterative) methods to find an approximate solution of the linear system 𝐴𝑥 = 𝑏, where 𝐴 is a sparse block matrix of dimension 𝑁 (number of grid points), with square blocks 𝐴!" of dimension 𝑉 (usually 𝑉 = 6).

Available solution methods are:

• (PNT) Point implicit: 𝑥!! = 𝐴!!!!𝑏! , 𝑖 = 1,… ,𝑁

• (JAC) Jacobi:

𝑥!(!!!) = 𝐴!!!! 𝑏! − 𝐴!"𝑥!

(!)!!!!,!!! , 𝑖 = 1,… ,𝑁

• (GS) Gauss-Seidel:

𝑥!(!!!) = 𝐴!!!! 𝑏! − 𝐴!"𝑥!

(!!!)!!!!!! − 𝐴!"𝑥!

(!)!!!!!! , 𝑖 = 1,… ,𝑁

• (SGS) Symmetric Gauss-Seidel:

𝑥!(!!!) = 𝐴!!!! 𝑏! − 𝐴!"𝑥!

(!!!)!!!!!! − 𝐴!"𝑥!

(!)!!!!!! , 𝑖 = 1,… ,𝑁,𝑁,… ,1

As a preliminary step for all methods, the LU decomposition of the diagonal blocks of A is calculated (LU), and used for the solution of the small systems 𝐴!!𝑥!.

In TAU these methods are used to construct a preconditioner for the Runge-Kutta scheme which does not require exact solution of the linear system, usually only a few 𝑘 iterations are performed.

MotivationBy using a hybrid parallelization approach we hope to achieve better scalability due to minimized communication needs as well as better load balancing capabilities.

ImplementationdetailsandperformanceresultsGASPI parallelization The parallelization is based on a domain decomposition of the computational grid. Matrix A is decomposed row-wise according to the mapping of grid points to subdomains. Each subdomain is assigned to one GASPI process and is first solved as an individual problem. For JAC and GS, the halo part of the approximate solution is then communicated after each completed sweep; for SGS, after each forward and backward sweep. For LU and PNT, only local data is involved, so no communication is needed.

OpenMP parallelization For LU, JAC and PNT, all iterations 𝑖 = 1…𝑁 are completely independent and are, thus, parallelized using #pragma omp for directives. For GS and SGS, iterations over rows 𝑖 have been manually split into nthread consecutive chunks (i.e. local subdomains). During a sweep each thread then calculates and updates only its own part of 𝑥, using other parts of 𝑥 from the previous sweep. All threads participate in GASPI communication.

Page 2: TAU-kernel linsolv (GASPI+OpenMP) - intertwine-project.eu · TAU-kernel linsolv (GASPI+OpenMP) Introduction TAU-kernel linsolv implements several (iterative) methods to find an approximate

Figure1 and Figure2 show performance results for the DLR-F6half test case (~15.8 ∙ 10! grid points) conducted on the DLR CASE-2 HPC-cluster, equipped with Ivy-Bridge processors (Intel Xeon Processor E5-2695 v2, dual-socket, 12 cores per socket, 2.4GHz) connected with InfiniBand FDR. Intel Parallel Studio XE 2017 (compiler, MPI Library and OpenMP runtime) and GPI-2 (Feb 14, 2017) were used for the tests.

a) pure MPI (24 processes per node)

b) GASPI + OpenMP (6 threads per process, 4

processes per node)

Figure 1: Parallel speedup compared to 5 nodes.

Figure 2: Relative performance of GASPI + OpenMP (6 threads per process, 4 processes per node)

compared to pure MPI (24 processes per node).

For the methods involving no communication (LU and PNT) the hybrid version generally performs slightly better than the original pure MPI version. For the methods involving communication (JAC, GS and SGS) the hybrid version does not reach the performance of the original version on 1 to 45 nodes but outperforms it on larger number of nodes.

Tau Birodalom - csatatestverek.gportal.hucsatatestverek.gportal.hu/portal/csatatestverek/upload/708623...Tau Birodalom By Dreamer88 . 2 Tartalomjegyzék Tau Birodalom Különleges

Complete solutions for networking...TAU series customer VoIP gateways 1 TAU-72.IP TAU-36.IP TAU-32M.IP TAU-24.IP TAU-16.IP Complete solutions for networking 2 Features and capabilities

TAU: Recent Advances KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views TAUg: Runtime Global Performance Data Access Using MPI Aroon

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon

TAU-1M TAU-2M 1.15.0 ennooran.com/files/POL/Datasheet/Voice/Tau1m2m/tau_1m_tau_2m_1.… · Application diagram TAU-1M.IP TAU-2M.IP 3G/4G modem TAU-1M.IP Service provider network USB

Kernel Based Mechanisms for High Performance I/O - TAU

LES PIONNIERS DE L’ANTI-GASPI - Com4Dev · DOSSIER DE PRESSE LES PIONNIERS DE L’ANTI-GASPI ... Le programme de prévention des déchets 2014-2020 ... En projet avec le conseil

Pique-nique Zéro Gaspi - Limoges Métropole...Dossier de presse 16 septembre 2014 Pique-nique Zéro Gaspi Le déjeuner citoyen et écolo de l’Agglo Contact presse - Hélène VALLEIX

TAU-kernel linsolv (MPI+OpenMP tasks) - intertwine-project.eu · TAU-kernel linsolv (MPI+OpenMP tasks) Introduction TAU-kernel linsolv implements several (iterative) methods to find

NERG’ HIC ! é Chasse au gaspi avec

cuisine anti-gaspi - Saint-Brieuc Agglomération recette cuisine anti... · cuisine anti-gaspi IDÉES RECETTES conserver, stériliser et plein d’autres astuces à tester à la maison

Madame Jette-rien & Monsieur Gaspi-tout