efficient utilization of computational resources in hybrid ... · • gpus are very attractive in...

34
Efficient utilization of computational resources in hybrid clusters Massimiliano Fatica

Upload: others

Post on 18-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Efficient utilization of computational resources in hybrid clusters Massimiliano Fatica

Page 2: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Overview

!  Motivations !  Challenges

!   Data movement !   Accuracy

!  Results !   Library for DGEMM ! TeraTF code

!  Conclusions

Page 3: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Motivations •  GPUs are very attractive in High Performance

Computing:   Massive multithreaded many-core chips   High flops count ( both SP and DP)   High memory bandwidth, ECC   Programming languages: CUDA C, CUDA Fortran,

OpenACC, ….   Tools: debuggers, profilers, libraries ( BLAS, FFT,

LAPACK,…)

Page 4: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Motivations •  GPU accelerated clusters are now a popular

configuration:   Top500 in June 2012: 58 systems with accelerators

(53 NVIDIA, 2 AMD, 2 Cell, 1 Intel MIC)   Extremely popular for oil and gas, molecular

dynamics, astrophysics

•  For specific workloads/configurations it is desirable to use both CPUs and GPUs

Page 5: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Data Movement •  CPU and GPU have different memory spaces

•  CPU memory system is optimized for latency •  GPU memory system is optimized for throughput

•  CPU and GPU are connected with PCI-e bus •  6 GB/s (gen2), 10 GB/s (gen3)

•  Data movement needs to be minimized/hidden •  Pinned memory to fully utilize PCI-e bus •  Overlap computations and data transfer

•  Tesla Fermi GPUs have two DMA engines

Page 6: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Accuracy

There may be several reasons for different results: •  Different algorithms: serial to parallel (reductions) •  Use of FMA instructions •  Math libraries

“A man with one watch knows what time it is; a man with two watches is never quite sure”   ~Lee Segall

Page 7: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Computing π

Compute pi in single precision (seed 1234567) Samples= 10000 Pi=3.16720009 Error= 0.2561E-01 Samples= 100000 Pi=3.13919997 Error= 0.2393E-02 Samples= 1000000 Pi=3.14109206 Error= 0.5007E-03 Samples= 10000000 Pi=3.14106607 Error= 0.5267E-03 Mismatch between CPU/GPU 78534862 78534859 Samples= 100000000 Pi=3.14139414 Error= 0.1986E-03

Compute pi in single precision (seed 1234) Samples= 10000 Pi=3.11120009 Error= 0.3039E-01 Samples= 100000 Pi=3.13632011 Error= 0.5273E-02 Samples= 1000000 Pi=3.14056396 Error= 0.1029E-02 Samples= 10000000 Pi=3.14092445 Error= 0.6683E-03 Samples= 100000000 Pi=3.14158082 Error= 0.1192E-04

Where is the difference coming from? if( ( hostData(i)**2+ hostData(i+Nhalf)**2) <= 1._fp_kind) inside_cpu=inside_cpu+1 (CPU) if( (deviceData(i)**2+deviceData(i+Nhalf)**2) <= 1._fp_kind ) inside=inside+1 (GPU) – Sum of the point inside the circle is done with integers ( no issues due to floating point arithmetic) – Computation of the distance from the origin (x*x+y*y), no special functions just + and *

Page 8: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Accuracy: effect of FMA instructions !   FERMI GPUs are IEEE-754 compliant, both for SP and DP !   Support for Fused Multiply-Add instruction ( IEEE 754-2008) !   Results with FMA could be different* from results without FMA !   It is possible to toggle FMA on/off with a compiler switch: !   Extremely useful to compare results to “golden” CPU output !   FMA will be present in future CPUs

Compute pi in single precision (seed=1234567 FMA disabled) Samples= 10000 Pi=3.16720009 Error= 0.2561E-01 Samples= 100000 Pi=3.13919997 Error= 0.2393E-02 Samples= 1000000 Pi=3.14109206 Error= 0.5007E-03 Samples= 10000000 Pi=3.14106607 Error= 0.5267E-03 Samples= 100000000 Pi=3.14139462 Error= 0.1981E-03

*Single precision GPU results with FMA are identical to double precision CPU results

Page 9: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Accuracy: effect of math functions

x^6 pow(x,6) pow(x,6.0) gcc 58139640.39860632

2705745697021484 58139640.398606315255165100097656

58139640.398606315255165100097656

icc 58139640.398606322705745697021484

58139640.398606322705745697021484

58139640.398606322705745697021484

pgcc 58139640.398606322705745697021484

58139640.398606322705745697021484

58139640.398606322705745697021484

CUDA 58139640.398606322705745697021484

58139640.398606322705745697021484

58139640.398606322705745697021484

Different libraries could give different results: x=-9.841215180935854789368022*2.0

Page 10: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Library for DGEMM Both CPU cores and GPUs are used in synergy with minor or no modifications to the original source code:

-  Host library intercepts the calls to DGEMM and executes them simultaneously on the GPUs and CPU cores -  Use of pinned memory for fast PCI-e transfers (up to 6 GB/s on x16 Gen2 , 11 GB/s on x16 Gen3) and overlap computations and communications

- Library used to accelerate Linpack, Paratec, similar approach (phiGemm library) used in Quantum Espresso

Page 11: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

PCI-e transfer speed

CUDA offers a fast PCI-e transfer when host memory is allocated with cudaMallocHost instead of regular malloc.

SUN Ultra 24

PCI-e x16 gen2

Supermicro 6016GT

PCI-e x16 gen2

Sandybridge

PCI-e x16 gen3

Pageable

Memory

Pinned

Memory

Pageable

Memory Pinned

Memory

Pageable

Memory

Pinned

Memory

H2D 2132 MB/s 5212 MB/s 4665 MB/s 5745 MB/s 3168 MB/s 11163 MB/s

D2H 1882 MB/s 5471 MB/s 4064 MB/s 6059 MB/s 2961 MB/s 10624 MB/s

Page 12: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

DGEMM: C = alpha A B + beta C

DGEMM(A,B,C) = DGEMM(A,B1,C1) U DGEMM(A,B2,C2)

The idea can be extended to multi-GPU configuration and to handle huge matrices Find the optimal split, knowing the relative DGEMM performances of the GPU and CPU cores

(GPU) (CPU)

Page 13: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Optimal split If A(M,K), B(K,N) and C(M,N), a DGEMM call performs 2*M*K*N operations

TCPU(M,K,N2) = TGPU(M,k,N1) N=N1+N2

If GCPU denotes the DGEMM performance of the CPU in Gflops and GGPU the one of the GPU,

The optimal split is

 η= GGPU / (GCPU+GGPU)

Page 14: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Overlap DGEMM on CPU and GPU // Copy A cublasSetMatrix (m, k , sizeof(A[0]), A, lda, devA, m_gpu); // Copy B1 cublasSetMatrix (k ,n_gpu, sizeof(B[0]), B, ldb, devB, k_gpu); // Copy C1 cublasSetMatrix (m, n_gpu, sizeof(C[0]), C, ldc, devC, m_gpu); // DGEMM on GPU // Control returns immediately to CPU cublasDgemm('n', 'n', m, n_gpu, k, alpha, devA, m,devB, k, beta, devC, m); // DGEMM on CPU dgemm('n','n',m,n_cpu,k, alpha, A, lda,B+ldb*n_gpu, ldb, beta,C+ldc*n_gpu, ldc); // Copy C1 status = cublasGetMatrix (m, n, sizeof(C[0]), devC, m, C, *ldc);

Using CUDA, it is very easy to express the workflow in the diagram

Page 15: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

DGEMM compute/copy analysis !   Assume N,M>>K

!   Copy Time = 8∗(𝑚𝑘+𝑛𝑘+2𝑚𝑛)/𝑃𝐶𝐼𝑒  ~ 16𝑚𝑛/𝑃𝐶𝐼𝑒 

!   Compute Time = 2𝑚𝑛𝑘/𝐺𝐹𝐿𝑂𝑃𝑆 

!   % Compute = 1/1+ 8∗𝐺𝐹𝐿𝑂𝑃𝑆/𝐾∗𝑃𝐶𝐼𝑒  =   1/1+𝑥 ,  x= 8∗GFLOPS/K∗PCIe 

Page 16: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Fermi DGEMM Strategy   Slice Matrix into several pieces

  Use Stream API

—  Overlap COPY + Compute

Copy A H2D

Loop over pieces:

Copy Bi,Ci H2D

DGEMM A,Bi,Ci

Copy Ci D2H

CPU DGEMM A,B_last,C_last

Page 17: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

DGEMM Compute/Copy Overlap

Page 18: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Additional Overlap strategy

 Copy A matrix can be significant for smaller matrix sizes

 Split A for additional Overlap

Page 19: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

DGEMM Performance

0

20

40

60

80

100

120

128

320

512

704

896

1088

1280

1472

1664

1856

2048

2240

2432

2624

2816

3008

3200

3392

3584

3776

3968

4160

4352

4544

4736

4928

5120

5312

5504

5696

5888

6080

GFLOPs

Size

Xeon  Quad-­‐core  2.8  GHz,  MKL  10.3Tesla  C1060  GPU  (1.296  GHz)CPU  +  GPU

Page 20: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

0  

50  

100  

150  

200  

250  

300  

350  

400  

450  

500  

0   2000   4000   6000   8000   10000   12000   14000   16000   18000  

GFLOPS  

Size:  N=M  (K=1024)  

Dual  Quad-­‐Core  Xeon  X5550  2.66  GHz  8  cores  MKL  10.2.4.032  Tesla  M2050  "Fermi"  1.15  GHz  

Fermi DGEMM Performance

435 CPU+GPU

350 GPU

85 CPU

Page 21: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Optimizations – Auto Split !   Keep track of CPU and GPU performance and adjust split

! Wallclock() CPU time ! Cuda event record GPU time !   Compute optimal split for next iteration

cudaEventRecord(GPU_start,0);

Loop: launch GPU copys + Kernels

cudaEventRecord(GPU_stop,0);

CPU_start = wallclock();

Call CPU_DGEMM

CPU_stop = wallclock();

cudaEventSynchronize(GPU_stop);

GPU_GFLOPS = GPU_FLOPS/GPU_TIME

CPU_GFLOPS = CPU_FLOPS/CPU_TIME

SPLIT = GPU_GFLOPS/(GPU_GFLOPS+CPU_GFLOPS)

Page 22: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Results on single node Dual Intel Xeon X5560 2.8 GHz 96GB memory, 2 Tesla M2050

- Peak DP: 89 + 515*(2) = 604 (1119) GFLOPS - DGEMM ( 2/3 of peak on Fermi) : 89 + 350*(2) = 439 (789) GFLOPS

================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 108032 768 1 1 2011.42 4.179e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0039415 ...... PASSED ================================================================================

705.1 GFLOPS = 63% of “PEAK” or 89% of “DGEMM”

================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 108032 768 1 2 1192.13 7.051e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0040532 ...... PASSED ================================================================================

417.9 GFLOPS = 69% of “PEAK” or 95% of “DGEMM”

Page 23: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Results on clusters

T/V N NB P Q Time Gflops ------------------------------------------------------------------------------------------------- WR13C2L4 2359296 768 32 145 6886.10 1.271e+06 -------------------------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033806 ...... PASSED

53 systems with Tesla GPUs on latest Top500 (Jun 2012)

Several machines with thousands of Tesla GPUs

Page 24: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Results: TeraTF

! TeraTF is a 3D Euler Hydrodynamics solver !   2nd order Godunov-type scheme, 3rd order remapping !   Various Riemann solvers:

!   Exact ( used in SPEC benchmark ) ! Dukowicz ( used in benchmark from CEA ) !   Acoustic

!   Fortran 90 with MPI and OpenMP parallelization !   Porting to GPU done with CUDA Fortran

!   Part of SpecMPI 2007 Benchmark

Page 25: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

CUDA Fortran !   PGI / NVIDIA collaboration !   Same CUDA Programming model as CUDA-C !   Program GPU in Fortran Syntax !   Strongly Typed – variables with device-type reside in

GPU memory !   Use standard allocate, deallocate !   Copy between CPU and GPU with assignment statement

( GPU_array = CPU_array ) !   Copy subset of arrays with interval notation

Page 26: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Data Layout ! Hydro_vars( vars, i, j, k) = Array of Structs

! Hydro_vars( i, j, k, vars) = Struct of Arrays ... ... ... ...

... CPU sequential access = efficient

GPU parallel “Warp” access = inefficient (uncoalesced)

GPU parallel “Warp” access = efficient (coalesced)

Page 27: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Data Layout

!   Pad leading dimensions !   Multiple of memory

transaction granularity !   Coalescing / alignment

!   Equalize Leading Dimensions !   Reduce register

pressure / simplify addressing

k

i

plane_stride

comm buffers

Hydro_vars

Cell Work

Node Work

Page 28: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

TeraTF results !   240^3 Grid, SOD problem, Dukowicz and Exact Riemann solver !   8 MPI processes ( 2x2x2 ), 1 GPU per MPI process !   8 M2050 GPUs, Dual Quad-Core Xeon X5560 (2.66 GHz) !   4 GPUs share x16 PCIe Link (not optimal)

Page 29: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Accuracy of GPU and CPU results

! Dukowicz : GPU and CPU results are identical !   Exact : Difference in last digit ( POW ? )

START: CPU : GPU : Masse : 0.1082530240533624 : 0.1082530240533624 Energie totale : 0.2646185032453942 : 0.2646185032453942 END (Dukowicz): Masse : 0.1082530240536995 : 0.1082530240536995 Energie totale : 0.2646185032454266 : 0.2646185032454266 END (Exact): Masse : 0.1082530240536390 : 0.1082530240536390 Energie totale : 0.2646185032458129 : 0.2646185032458130

Page 30: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

TeraTF Performance

15x

19x

Page 31: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

TeraTF Performance

MPI tasks GPUs Time(s) Speed-up 1x1x1 1 691 1 2x2x2 8 112 6.1 3x3x3 27 42 16.4 4x4x4 64 24 28.7 5x5x5 125 16 42.3

Medium case of the SPEC MPI configuration (2403, Exact Riemann Solver) on Cray XK6

Page 32: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Hybrid approach !   Partition local domain

across GPU and CPU (openMP)

k

i

GPU

CPU

i

GPU CPU

X and Y Phase

Z Phase

Page 33: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

TeraTF hybrid performances

1

3.96

6.4

21

27

1 CPU Core 8 OMP 8 MPI GPU GPU + 8 OMP

TeraTF Hybrid Performance

Page 34: Efficient utilization of computational resources in hybrid ... · • GPUs are very attractive in High Performance Computing: Massive multithreaded many-core chips High flops count

Conclusions •  It is possible to fully utilize the computational resources available on hybrid clusters • With the right software design, PCI-e transfer time can be hidden

•  Improved Power Efficiency