towards auto-tuning facilities into supercomputers in operation - the fiber approach and minimizing...

Towards Auto-tuning Facilities into Supercomputers in Operation

- The FIBER approach and minimizing software-stack requirements -

Takahiro Katagiri (片桐孝洋)Information Technology Center,

The University of Tokyo

（東京大学情報基盤センター）

1

2014 ATAT in HPSC, National Taiwan University,March 15, 2014 (Saturday), Performance 10:10-10:30

Joint work with: Satoshi Ohshima（大島聡史）Masaharu Matsumoto（松本正晴）

Overview

1. Background and ppOpen-HPC Project

2. ppOpen-AT Basics

3. Adaptation to an FDM Application

4. Performance Evaluation

5. Conclusion

2

Overview


2. ppOpen-AT Basics



5. Conclusion

3

Background High-Thread Parallelism (HTP)

◦ Multi-core and many-core processors are pervasive. Multicore CPUs: 8-16 cores, 16-64 Threads with Hyper

Threading (HT) or Simultaneous Multithreading (SMT)

Many Core CPU: Xeon Phi – 60 cores, 240 Threads with HT.

◦ Utilizing parallelism with full-threads is important.

4

Performance Portability (PP)

◦ Keeping high performance in multiple computer environments.

Not only multiple CPUs, but also multiple compilers.

Run-time information, such as loop length and number of threads, is important.

◦ Auto-tuning (AT) is one of candidates technologies to establish PP in multiple computer environments.

ppOpen-HPC Project Middleware for HPC and Its AT

◦ Supported by JST, CREST, from 2011FY to 2016FY.

◦ PI: Professor Kengo Nakajima (U. Tokyo)

ppOpen-HPC ◦ An open source infrastructure for reliable simulation

codes on post-peta (pp) scale parallel computers.

◦ consists of various types of libraries, which covers 5 kinds of discretization methods for scientific computations.

ppOpen-AT ◦ An auto-tuning language for ppOpen-HPC codes

◦ Using knowledge of previous project, that is ABCLibScript Project.

◦ Auto-tuning language based on directives of AT. 5

6

FVM DEMFDMFEM

Many-core CPUs GPULow Power

CPUsVector CPUs

MG

COMM

Auto-Tuning FacilityCode Generation for Optimization CandidatesSearch for the best candidateAutomatic Execution for the optimization

Resource Allocation Facility

ppOpen-APPL

ppOpen-MATH

BEM

ppOpen-AT

User’s Program

GRAPH VIS MP

STATIC DYNAMIC

ppOpen-SYS FT

Specify The Best Execution Allocations

Software Architecture of ppOpen-HPC

Overview


2. ppOpen-AT Basics



5. Conclusion

7

Overview of FIBER (Framework of Install‐time, Before Execute‐time and Run‐time Auto‐tuning) [T.Katagiri et.al., 03]

#pragma oat …

Legacy codes with AT directives

#pragma oat …

#pragma oat …

Preprocessor of the AT directives

#implementation3

#implementation2

#implementation1

Legacy codes with AT functions and AT candidates specified by the AT directives

Compiling

FIBERAuto‐tuner

Best Parameters

PerformanceDatabase

Install‐time

Before Execute‐time

Run‐time : AT timings defined by FIBER.The timings are specified

by the AT directives.

API on FIBER

Executable codes with AT functions

User Specifies Parameter

A Scenario to Software Developers for ppOpen-AT

9

Executable Code with Optimization Candidates

and AT Function

Invocate dedicated Preprocessor

Software Developer

Description of AT by UsingppOpen-AT

Program with AT Functions

Optimizationthat cannot be established by

compilers

#pragma oat install unroll (i,j,k) region start#pragma oat varied (i,j,k) from 1 to 8

for(i = 0 ; i < n ; i++){for(j = 0 ; j < n ; j++){for(k = 0 ; k < n ; k++){A[i][j]=A[i][j]+B[i][k]*C[k][j]; }}}

#pragma oat install unroll (i,j,k) region end

■Automatic Generated FunctionsOptimization CandidatesPerformance MonitorParameter Search Performance Modeling

Description By Software DeveloperOptimizations for Source Codes, Computer Resource, Power Consumption

Compiler Optimization and AT1. Loop length is unclear in compile‐time. Optimal loop split and loop fusion are specified in run‐time. Run‐time compiling is on only research.

2. Loop split with data dependencies. Some loop splits require increase of computations or memory

space. Some compilers are providing directive, but the directive is not

standardized. Code optimization is not also standardized between compilers.

3. Restrictions from Operation in Supercomputers. Some supercomputer environments cannot supply required “software‐

stack”, or the software‐stack cannot be utilize due to restriction by operation. Out of target for the system due to hardware restriction.

Ex) CAPS in the K‐computer. Operation costs (budgets), vender strategy, etc…. 10

Overview


2. ppOpen-AT Basics



5. Conclusion

11

EARLY EXPERIENCE IN

EXPLICIT METHOD

(FINITE DIFFERENCE

METHOD)

12

Target ApplicationSeism3D

: Simulation software for seismic wave analysis.

Strategic simulation software in Japan.

Developed by Professor Furumura at the University of Tokyo.

◦ The code is re-constructed as ppOpen-APPL/FDM.

Finite Differential Method (FDM)

3D simulation

◦ 3D arrays are allocated.

Data type: Single Precision (real*4)

13

Source: http://www.eri.u-tokyo.ac.jp/furumura/tsunami/tsunami.html

The Heaviest Loop (20%+ to Total Time)

14

!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)DO K = 1, NZDO J = 1, NYDO I = 1, NX

RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

END DOEND DOEND DO!$omp end parallel do

A Flow Dependency

Optimization Possibilities Loop Splitting

◦ To reduce spill code.

◦ To maximize register usage.

Loop fusion (Loop Collapse)

◦ 3 nested loop -> The following two approaches.

◦ One nest loop

To increase outer loop parallelism for thread parallelism.

◦ Two nested loop

To increase outer loop parallelism for thread parallelism.

To utilize pre-fetching for the inner loop.15

Loop fusion –One dimensional (a loop collapse)

16

!$omp parallel do private(k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)

DO KK = 1, NZ * NY * NXK = (KK-1)/(NY*NX) + 1J = mod((KK-1)/NX,NY) + 1I = mod(KK-1,NX) + 1RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

END DO!$omp end parallel do

Merit: Loop length is huge.This is good for OpenMP thread parallelism.

Loop fusion – Two dimensional

17


DO KK = 1, NZ * NY K = (KK-1)/NY + 1J = mod(KK-1,NY) + 1DO I = 1, NX

RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K) DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DTDXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K) DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

ENDDOEND DO

!$omp end parallel do

Example:

Merit: Loop length is huge.This is good for OpenMP thread parallelism.

This I-loop enables us an opportunity of pre-fetching.

18


DO K = 1, NZDO J = 1, NYDO I = 1, NX

RL1 = LAM (I,J,K)RM1 = RIG (I,J,K)RM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K)DZVZ1 = DZVZ(I,J,K); D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)-RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)-RM2*(DXVX1+DYVY1) ) * DT

ENDDODO I = 1, NX

RM1 = RIG (I,J,K)DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K)DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

END DOEND DO END DO

Re-computation(a copy) is needed.

⇒Compilers do not apply it without directive.

Perfect Splitting: Two 3-nested Loops

New Directives for ppOpen‐AT • m_stress.f90（ppohFDM_update_stress）!OAT$ install LoopFusionSplit region start!$omp parallel do private (k,j,i,RL1,RM1,RM2,RLRM2,DXVX1,DYVY1,DZVZ1,D3V3,DXVYDYVX1,DXVZDZVX1,DYVZDZV1)do k = NZ00, NZ01do j = NY00, NY01do i = NX00, NX01RL1 = LAM (I,J,K)

!OAT$ SplitPointCopyDef sub region startRM1 = RIG (I,J,K)

!OAT$ SplitPointCopyDef sub region endRM2 = RM1 + RM1; RLRM2 = RL1+RM2; DXVX1 = DXVX(I,J,K); DYVY1 = DYVY(I,J,K); DZVZ1 = DZVZ(I,J,K)D3V3 = DXVX1 + DYVY1 + DZVZ1SXX (I,J,K) = SXX (I,J,K) + (RLRM2*(D3V3)‐RM2*(DZVZ1+DYVY1) ) * DTSYY (I,J,K) = SYY (I,J,K) + (RLRM2*(D3V3)‐RM2*(DXVX1+DZVZ1) ) * DTSZZ (I,J,K) = SZZ (I,J,K) + (RLRM2*(D3V3)‐RM2*(DXVX1+DYVY1) ) * DT

!OAT$ SplitPoint (K,J,I)!OAT$ SplitPointCopyInsert

DXVYDYVX1 = DXVY(I,J,K)+DYVX(I,J,K); DXVZDZVX1 = DXVZ(I,J,K)+DZVX(I,J,K);DYVZDZVY1 = DYVZ(I,J,K)+DZVY(I,J,K)SXY (I,J,K) = SXY (I,J,K) + RM1 * DXVYDYVX1 * DTSXZ (I,J,K) = SXZ (I,J,K) + RM1 * DXVZDZVX1 * DTSYZ (I,J,K) = SYZ (I,J,K) + RM1 * DYVZDZVY1 * DT

end doend do

end do!$omp end parallel do!OAT$ install LoopFusionSplit region end

Re-calculation is defined in here.

Using the re-calculation is defined in here.

Loop Split Point

Candidates of Auto-generated Codes #1 [Baseline] : Original three-nested loop.

#2 [Spilt] : Loop split for the k-loop(separated two three-nested loops).

#3 [Split] : Loop split for the j-loop.

#4 [Split] : Loop split for the i-loop.

#5 [Fusion] : Loop fusion for the k-loop and j-loop (a two-nested loop).

#6 [Split and Fusion] : Loop fusions for the k-loop and j-loop for the loops in #2.

#7 [Fusion] : Loop fusions for the k-loop, j-loop, and i-loop (loop collapse).

#8 [Split and Fusion] : Loop fusions for the k-loop, j-loop,and i-loop for the loops in #2 (loop collapses for the separated two-loops).

20

Overview


2. ppOpen-AT Basics



5. Conclusion

21

PERFORMANCE EVALUATION

22

An Example of Seism3D Simulation West part earthquake in Tottori prefecture in Japan

at year 2000. ([1], pp.14) The region of 820km x 410km x 128 km is discretized with 0.4km.

NX x NY x NZ = 2050 x 1025 x 320 ≒ 6.4 : 3.2 : 1.

[1] T. Furumura, “Large-scale Parallel FDM Simulation for Seismic Waves and Strong Shaking”, Supercomputing News, Information Technology Center, The University of Tokyo, Vol.11, Special Edition 1, 2009. In Japanese.

Figure : Seismic wave translations in west part earthquake in Tottori prefecture in Japan. (a) Measured waves; (b) Simulation results; (Reference : [1] in pp.13)

Test Condition Software version◦ ppOpen-APPL/FDM version 0.2

◦ ppOpen-AT version 0.2

Target Kernels in ppOpen-APPL/FDM◦ TOP 10 Kernels (All three-nested loops) Update_stress

Update_vel

Update_spong

Other 7 kernels in finite differential computations.

AT Timing◦ Before Execute-time After fixing problem size and the number of threads by user.

Then, adapt AT in time for calling of the library routine.

All candidates of AT are evaluated. (Brute-force search) ◦ Only 8+3+6+7*3 = 38 candidates.

#Repeats for each kernel in the AT mode◦ 100 times

24

The Xeon Phi Cluster System Intel Xeon (Ivy Bridge) : HOST CPU

OS：Red Hat Enterprise Linux Server release 6.2 #Nodes：32 (Available: 14 nodes) CPU：Intel Xeon E5‐2670 V2 @ 2.50GHz，2 sockets×10 cores Hyper Threading：ON Theoretical Peak Performance for 1 node of CPU：400 GFLOPS Memory size on 1 node：64 GB Interconnect：Infiniband Compiler：Intel Fortran version 14.0.0.080 Build 20130728 Compiler Option：‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel KMP_AFFINITY=granularity=fine, compact (all threads are on socket)

Intel Xeon Phi co‐processor (Xeon Phi) : Accelerator CPU：Xeon Phi 5110P (B1 stepping) 1.053 GHz，60 core Memory size：8 GB Theoretical Peak Performance ：1 TFLOPS ( = 1.053 GHz x 16 FLOPS x 60 core) Connected one board on each node of the Cluster Native mode Compiler：Intel Fortran version 14.0.0.080 Build 20130728 Compiler Option：‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel

–mmiccl ‐align array64byte KMP_AFFINITY=granularity=fine, balanced (all threads are equally distributed on

socket)

RESULT OF THE XEON PHI

Execution Details• ppOpen‐APPL/FDM ver.0.2• ppOpen‐AT ver.0.2• Target Problem Size

– NX * NY * NZ = 256 x 96 x 100 / node– NX * NY * NZ = 32 * 16 * 20 / core (!= per MPI Process)

• Native mode for MIC• Target MPI Processes and Threads on the Xeon Phi

– 1 node of the Xeon Phi with 4 HT (Hyper Threading)– PXTY : XMPI Processes and Y Threads per process– P240T1 : pure MPI with 4HT per core– P120T2– P60T4– P16T15– P8T30 : Minimum Hybrid MPI‐OpenMP execution for

ppOpen‐APPL/FDM, since it needs minimum 8 MPI Processes. • The number of iterations for the kernels: 100

2.11 2.32 2.33 2.96 3.14

1.29 1.70 1.74 1.91 1.97

0

1

2

3

4

P240T1 P120T2 P60T4 P16T15 P8T30

Without AT With AT

AT Effect (update_stress, Xeon Phi)[Seconds]

KMP_AFFINITY=balanced

‐align array64byte New Kernels

1.63 1.36 1.34

1.55 1.59

0

0.5

1

1.5

2

P240T1 P120T2 P60T4 P16T15 P8T30

Speedups

Best SW: 6 Best SW: 5 Best SW: 5 Best SW: 5 Best SW: 6

Conclusion Loop fusion to obtain high parallelism

is one of key techniques for current multi- and many-core architectures.◦ Execution with 240 threads/MPI process

in the Xeon Phi.

◦ Strong scaling with more than 10,000+ cores in the FX10.

To do AT in supercomputers in operation, minimizing requirement of “software-stack” is a practical way to establish AT.

ppOpen-AT is free software!

ppOpen-AT version 0.2 is available!

The licensing is MIT.

Please access the following page:

http://ppopenhpc.cc.u-tokyo.ac.jp/

30

31

Thank for your attention!

Questions?