towards auto-tuning facilities into supercomputers in operation - the fiber approach and minimizing...
TRANSCRIPT
Towards Auto-tuning Facilities into Supercomputers in Operation
- The FIBER approach and minimizing software-stack requirements -
Takahiro Katagiri (片桐 孝洋)Information Technology Center,
The University of Tokyo
(東京大学 情報基盤センター)
1
2014 ATAT in HPSC, National Taiwan University,March 15, 2014 (Saturday), Performance 10:10-10:30
Joint work with: Satoshi Ohshima(大島 聡史)Masaharu Matsumoto(松本 正晴)
Overview
1. Background and ppOpen-HPC Project
2. ppOpen-AT Basics
3. Adaptation to an FDM Application
4. Performance Evaluation
5. Conclusion
2
Overview
1. Background and ppOpen-HPC Project
2. ppOpen-AT Basics
3. Adaptation to an FDM Application
4. Performance Evaluation
5. Conclusion
3
Background High-Thread Parallelism (HTP)
◦ Multi-core and many-core processors are pervasive. Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper
Threading (HT) or Simultaneous Multithreading (SMT)
Many Core CPU: Xeon Phi – 60 cores, 240 Threads with HT.
◦ Utilizing parallelism with full-threads is important.
4
Performance Portability (PP)
◦ Keeping high performance in multiple computer environments.
Not only multiple CPUs, but also multiple compilers.
Run-time information, such as loop length and number of threads, is important.
◦ Auto-tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.
ppOpen-HPC Project Middleware for HPC and Its AT
◦ Supported by JST, CREST, from 2011FY to 2016FY.
◦ PI: Professor Kengo Nakajima (U. Tokyo)
ppOpen-HPC ◦ An open source infrastructure for reliable simulation
codes on post-peta (pp) scale parallel computers.
◦ consists of various types of libraries, which covers 5 kinds of discretization methods for scientific computations.
ppOpen-AT ◦ An auto-tuning language for ppOpen-HPC codes
◦ Using knowledge of previous project, that is ABCLibScript Project.
◦ Auto-tuning language based on directives of AT. 5
6
FVM DEMFDMFEM
Many-core CPUs GPULow Power
CPUsVector CPUs
MG
COMM
Auto-Tuning FacilityCode Generation for Optimization CandidatesSearch for the best candidateAutomatic Execution for the optimization
Resource Allocation Facility
ppOpen-APPL
ppOpen-MATH
BEM
ppOpen-AT
User’s Program
GRAPH VIS MP
STATIC DYNAMIC
ppOpen-SYS FT
Specify The Best Execution Allocations
Software Architecture of ppOpen-HPC
Overview
1. Background and ppOpen-HPC Project
2. ppOpen-AT Basics
3. Adaptation to an FDM Application
4. Performance Evaluation
5. Conclusion
7
Overview of FIBER (Framework of Install‐time, Before Execute‐time and Run‐time Auto‐tuning) [T.Katagiri et.al., 03]
#pragma oat …
Legacy codes with AT directives
#pragma oat …
#pragma oat …
Preprocessor of the AT directives
#implementation3
#implementation2
#implementation1
Legacy codes with AT functions and AT candidates specified by the AT directives
Compiling
FIBERAuto‐tuner
Best Parameters
PerformanceDatabase
Install‐time
Before Execute‐time
Run‐time : AT timings defined by FIBER.The timings are specified
by the AT directives.
API on FIBER
Executable codes with AT functions
User Specifies Parameter
A Scenario to Software Developers for ppOpen-AT
9
Executable Code with Optimization Candidates
and AT Function
Invocate dedicated Preprocessor
Software Developer
Description of AT by UsingppOpen-AT
Program with AT Functions
Optimizationthat cannot be established by
compilers
#pragma oat install unroll (i,j,k) region start#pragma oat varied (i,j,k) from 1 to 8
for(i = 0 ; i < n ; i++){for(j = 0 ; j < n ; j++){for(k = 0 ; k < n ; k++){A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}
#pragma oat install unroll (i,j,k) region end
■Automatic Generated FunctionsOptimization CandidatesPerformance MonitorParameter Search Performance Modeling
Description By Software DeveloperOptimizations for Source Codes, Computer Resource, Power Consumption
Compiler Optimization and AT1. Loop length is unclear in compile‐time. Optimal loop split and loop fusion are specified in run‐time. Run‐time compiling is on only research.
2. Loop split with data dependencies. Some loop splits require increase of computations or memory
space. Some compilers are providing directive, but the directive is not
standardized. Code optimization is not also standardized between compilers.
3. Restrictions from Operation in Supercomputers. Some supercomputer environments cannot supply required “software‐
stack”, or the software‐stack cannot be utilize due to restriction by operation. Out of target for the system due to hardware restriction.
Ex) CAPS in the K‐computer. Operation costs (budgets), vender strategy, etc…. 10
Overview
1. Background and ppOpen-HPC Project
2. ppOpen-AT Basics
3. Adaptation to an FDM Application
4. Performance Evaluation
5. Conclusion
11
EARLY EXPERIENCE IN
EXPLICIT METHOD
(FINITE DIFFERENCE
METHOD)
12
Target ApplicationSeism3D
: Simulation software for seismic wave analysis.
Strategic simulation software in Japan.
Developed by Professor Furumura at the University of Tokyo.
◦ The code is re-constructed as ppOpen-APPL/FDM.
Finite Differential Method (FDM)
3D simulation
◦ 3D arrays are allocated.
Data type: Single Precision (real*4)
13
Source: http://www.eri.u-tokyo.ac.jp/furumura/tsunami/tsunami.html
The Heaviest Loop (20%+ to Total Time)
14
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)DO K = 1, NZDO J = 1, NYDO I = 1, NX
RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DOEND DOEND DO!$omp end parallel do
A Flow Dependency
Optimization Possibilities Loop Splitting
◦ To reduce spill code.
◦ To maximize register usage.
Loop fusion (Loop Collapse)
◦ 3 nested loop -> The following two approaches.
◦ One nest loop
To increase outer loop parallelism for thread parallelism.
◦ Two nested loop
To increase outer loop parallelism for thread parallelism.
To utilize pre-fetching for the inner loop.15
Loop fusion –One dimensional (a loop collapse)
16
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY * NXK = (KK-1)/(NY*NX) + 1J = mod((KK-1)/NX,NY) + 1I = mod(KK-1,NX) + 1RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DO!$omp end parallel do
Merit: Loop length is huge.This is good for OpenMP thread parallelism.
Loop fusion – Two dimensional
17
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
DO KK = 1, NZ * NY K = (KK-1)/NY + 1J = mod(KK-1,NY) + 1DO I = 1, NX
RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
ENDDOEND DO
!$omp end parallel do
Example:
Merit: Loop length is huge.This is good for OpenMP thread parallelism.
This I-loop enables us an opportunity of pre-fetching.
18
!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)
DO K = 1, NZDO J = 1, NYDO I = 1, NX
RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT
ENDDODO I = 1, NX
RM1 = RIG (I,J,K)DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
END DOEND DO END DO
Re-computation(a copy) is needed.
⇒Compilers do not apply it without directive.
Perfect Splitting: Two 3-nested Loops
New Directives for ppOpen‐AT • m_stress.f90(ppohFDM_update_stress)!OAT$ install LoopFusionSplit region start!$omp parallel do private (k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01RL1 = LAM (I,J,K)
!OAT$ SplitPointCopyDef sub region startRM1 = RIG (I,J,K)
!OAT$ SplitPointCopyDef sub region endRM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT
!OAT$ SplitPoint (K,J,I)!OAT$ SplitPointCopyInsert
DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K); DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT
end doend do
end do!$omp end parallel do!OAT$ install LoopFusionSplit region end
Re-calculation is defined in here.
Using the re-calculation is defined in here.
Loop Split Point
Candidates of Auto-generated Codes #1 [Baseline] : Original three-nested loop.
#2 [Spilt] : Loop split for the k-loop(separated two three-nested loops).
#3 [Split] : Loop split for the j-loop.
#4 [Split] : Loop split for the i-loop.
#5 [Fusion] : Loop fusion for the k-loop and j-loop (a two-nested loop).
#6 [Split and Fusion] : Loop fusions for the k-loop and j-loop for the loops in #2.
#7 [Fusion] : Loop fusions for the k-loop, j-loop, and i-loop (loop collapse).
#8 [Split and Fusion] : Loop fusions for the k-loop, j-loop,and i-loop for the loops in #2 (loop collapses for the separated two-loops).
20
Overview
1. Background and ppOpen-HPC Project
2. ppOpen-AT Basics
3. Adaptation to an FDM Application
4. Performance Evaluation
5. Conclusion
21
PERFORMANCE EVALUATION
22
An Example of Seism3D Simulation West part earthquake in Tottori prefecture in Japan
at year 2000. ([1], pp.14) The region of 820km x 410km x 128 km is discretized with 0.4km.
NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.
[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.
Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)
Test Condition Software version◦ ppOpen-APPL/FDM version 0.2
◦ ppOpen-AT version 0.2
Target Kernels in ppOpen-APPL/FDM◦ TOP 10 Kernels (All three-nested loops) Update_stress
Update_vel
Update_spong
Other 7 kernels in finite differential computations.
AT Timing◦ Before Execute-time After fixing problem size and the number of threads by user.
Then, adapt AT in time for calling of the library routine.
All candidates of AT are evaluated. (Brute-force search) ◦ Only 8+3+6+7*3 = 38 candidates.
#Repeats for each kernel in the AT mode◦ 100 times
24
The Xeon Phi Cluster System Intel Xeon (Ivy Bridge) : HOST CPU
OS:Red Hat Enterprise Linux Server release 6.2 #Nodes:32 (Available: 14 nodes) CPU:Intel Xeon E5‐2670 V2 @ 2.50GHz,2 sockets×10 cores Hyper Threading:ON Theoretical Peak Performance for 1 node of CPU:400 GFLOPS Memory size on 1 node:64 GB Interconnect:Infiniband Compiler:Intel Fortran version 14.0.0.080 Build 20130728 Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel KMP_AFFINITY=granularity=fine, compact (all threads are on socket)
Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator CPU:Xeon Phi 5110P (B1 stepping) 1.053 GHz,60 core Memory size:8 GB Theoretical Peak Performance :1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core) Connected one board on each node of the Cluster Native mode Compiler:Intel Fortran version 14.0.0.080 Build 20130728 Compiler Option:‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel
–mmiccl ‐align array64byte KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on
socket)
RESULT OF THE XEON PHI
Execution Details• ppOpen‐APPL/FDM ver.0.2• ppOpen‐AT ver.0.2• Target Problem Size
– NX * NY * NZ = 256 x 96 x 100 / node– NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process)
• Native mode for MIC• Target MPI Processes and Threads on the Xeon Phi
– 1 node of the Xeon Phi with 4 HT (Hyper Threading)– PXTY : XMPI Processes and Y Threads per process– P240T1 : pure MPI with 4HT per core– P120T2– P60T4– P16T15– P8T30 : Minimum Hybrid MPI‐OpenMP execution for
ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes. • The number of iterations for the kernels: 100
2.11 2.32 2.33 2.96 3.14
1.29 1.70 1.74 1.91 1.97
0
1
2
3
4
P240T1 P120T2 P60T4 P16T15 P8T30
Without AT With AT
AT Effect (update_stress, Xeon Phi)[Seconds]
KMP_AFFINITY=balanced
‐align array64byte New Kernels
1.63 1.36 1.34
1.55 1.59
0
0.5
1
1.5
2
P240T1 P120T2 P60T4 P16T15 P8T30
Speedups
Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6
Conclusion Loop fusion to obtain high parallelism
is one of key techniques for current multi- and many-core architectures.◦ Execution with 240 threads/MPI process
in the Xeon Phi.
◦ Strong scaling with more than 10,000+ cores in the FX10.
To do AT in supercomputers in operation, minimizing requirement of “software-stack” is a practical way to establish AT.
ppOpen-AT is free software!
ppOpen-AT version 0.2 is available!
The licensing is MIT.
Please access the following page:
http://ppopenhpc.cc.u-tokyo.ac.jp/
30
31
Thank for your attention!
Questions?