efficient utilization of computational resources in hybrid ... · • gpus are very attractive in...

Efficient utilization of computational resources in hybrid clusters Massimiliano Fatica

Overview

!  Motivations !  Challenges

!   Data movement !   Accuracy

!  Results !   Library for DGEMM ! TeraTF code

!  Conclusions

Motivations •  GPUs are very attractive in High Performance

Computing:   Massive multithreaded many-core chips   High flops count ( both SP and DP)   High memory bandwidth, ECC   Programming languages: CUDA C, CUDA Fortran,

OpenACC, ….   Tools: debuggers, profilers, libraries ( BLAS, FFT,

LAPACK,…)

Motivations •  GPU accelerated clusters are now a popular

configuration:   Top500 in June 2012: 58 systems with accelerators

(53 NVIDIA, 2 AMD, 2 Cell, 1 Intel MIC)   Extremely popular for oil and gas, molecular

dynamics, astrophysics

•  For specific workloads/configurations it is desirable to use both CPUs and GPUs

Data Movement •  CPU and GPU have different memory spaces

•  CPU memory system is optimized for latency •  GPU memory system is optimized for throughput

•  CPU and GPU are connected with PCI-e bus •  6 GB/s (gen2), 10 GB/s (gen3)

•  Data movement needs to be minimized/hidden •  Pinned memory to fully utilize PCI-e bus •  Overlap computations and data transfer

•  Tesla Fermi GPUs have two DMA engines

Accuracy

There may be several reasons for different results: •  Different algorithms: serial to parallel (reductions) •  Use of FMA instructions •  Math libraries

“A man with one watch knows what time it is; a man with two watches is never quite sure” ~Lee Segall

Computing π

Compute pi in single precision (seed 1234567) Samples= 10000 Pi=3.16720009 Error= 0.2561E-01 Samples= 100000 Pi=3.13919997 Error= 0.2393E-02 Samples= 1000000 Pi=3.14109206 Error= 0.5007E-03 Samples= 10000000 Pi=3.14106607 Error= 0.5267E-03 Mismatch between CPU/GPU 78534862 78534859 Samples= 100000000 Pi=3.14139414 Error= 0.1986E-03

Compute pi in single precision (seed 1234) Samples= 10000 Pi=3.11120009 Error= 0.3039E-01 Samples= 100000 Pi=3.13632011 Error= 0.5273E-02 Samples= 1000000 Pi=3.14056396 Error= 0.1029E-02 Samples= 10000000 Pi=3.14092445 Error= 0.6683E-03 Samples= 100000000 Pi=3.14158082 Error= 0.1192E-04

Where is the difference coming from? if( ( hostData(i)**2+ hostData(i+Nhalf)**2) <= 1._fp_kind) inside_cpu=inside_cpu+1 (CPU) if( (deviceData(i)**2+deviceData(i+Nhalf)**2) <= 1._fp_kind ) inside=inside+1 (GPU) – Sum of the point inside the circle is done with integers ( no issues due to floating point arithmetic) – Computation of the distance from the origin (x*x+y*y), no special functions just + and *

Accuracy: effect of FMA instructions !   FERMI GPUs are IEEE-754 compliant, both for SP and DP !   Support for Fused Multiply-Add instruction ( IEEE 754-2008) !   Results with FMA could be different* from results without FMA !   It is possible to toggle FMA on/off with a compiler switch: !   Extremely useful to compare results to “golden” CPU output !   FMA will be present in future CPUs

Compute pi in single precision (seed=1234567 FMA disabled) Samples= 10000 Pi=3.16720009 Error= 0.2561E-01 Samples= 100000 Pi=3.13919997 Error= 0.2393E-02 Samples= 1000000 Pi=3.14109206 Error= 0.5007E-03 Samples= 10000000 Pi=3.14106607 Error= 0.5267E-03 Samples= 100000000 Pi=3.14139462 Error= 0.1981E-03

*Single precision GPU results with FMA are identical to double precision CPU results

Accuracy: effect of math functions

x^6 pow(x,6) pow(x,6.0) gcc 58139640.39860632

2705745697021484 58139640.398606315255165100097656

58139640.398606315255165100097656

icc 58139640.398606322705745697021484

58139640.398606322705745697021484

58139640.398606322705745697021484

pgcc 58139640.398606322705745697021484

58139640.398606322705745697021484

58139640.398606322705745697021484

CUDA 58139640.398606322705745697021484

58139640.398606322705745697021484

58139640.398606322705745697021484

Different libraries could give different results: x=-9.841215180935854789368022*2.0

Library for DGEMM Both CPU cores and GPUs are used in synergy with minor or no modifications to the original source code:

-  Host library intercepts the calls to DGEMM and executes them simultaneously on the GPUs and CPU cores -  Use of pinned memory for fast PCI-e transfers (up to 6 GB/s on x16 Gen2 , 11 GB/s on x16 Gen3) and overlap computations and communications

- Library used to accelerate Linpack, Paratec, similar approach (phiGemm library) used in Quantum Espresso

PCI-e transfer speed

CUDA offers a fast PCI-e transfer when host memory is allocated with cudaMallocHost instead of regular malloc.

SUN Ultra 24

PCI-e x16 gen2

Supermicro 6016GT

PCI-e x16 gen2

Sandybridge

PCI-e x16 gen3

Pageable

Memory

Pinned

Memory

Pageable

Memory Pinned

Memory

Pageable

Memory

Pinned

Memory

H2D 2132 MB/s 5212 MB/s 4665 MB/s 5745 MB/s 3168 MB/s 11163 MB/s

D2H 1882 MB/s 5471 MB/s 4064 MB/s 6059 MB/s 2961 MB/s 10624 MB/s

DGEMM: C = alpha A B + beta C

DGEMM(A,B,C) = DGEMM(A,B1,C1) U DGEMM(A,B2,C2)

The idea can be extended to multi-GPU configuration and to handle huge matrices Find the optimal split, knowing the relative DGEMM performances of the GPU and CPU cores

(GPU) (CPU)

Optimal split If A(M,K), B(K,N) and C(M,N), a DGEMM call performs 2*M*K*N operations

TCPU(M,K,N2) = TGPU(M,k,N1) N=N1+N2

If GCPU denotes the DGEMM performance of the CPU in Gflops and GGPU the one of the GPU,

The optimal split is

 η= GGPU / (GCPU+GGPU)

Overlap DGEMM on CPU and GPU // Copy A cublasSetMatrix (m, k , sizeof(A[0]), A, lda, devA, m_gpu); // Copy B1 cublasSetMatrix (k ,n_gpu, sizeof(B[0]), B, ldb, devB, k_gpu); // Copy C1 cublasSetMatrix (m, n_gpu, sizeof(C[0]), C, ldc, devC, m_gpu); // DGEMM on GPU // Control returns immediately to CPU cublasDgemm('n', 'n', m, n_gpu, k, alpha, devA, m,devB, k, beta, devC, m); // DGEMM on CPU dgemm('n','n',m,n_cpu,k, alpha, A, lda,B+ldb*n_gpu, ldb, beta,C+ldc*n_gpu, ldc); // Copy C1 status = cublasGetMatrix (m, n, sizeof(C[0]), devC, m, C, *ldc);

Using CUDA, it is very easy to express the workflow in the diagram

DGEMM compute/copy analysis !   Assume N,M>>K

!   Copy Time = 8∗(𝑚𝑘+𝑛𝑘+2𝑚𝑛)/𝑃𝐶𝐼𝑒  ~ 16𝑚𝑛/𝑃𝐶𝐼𝑒 

!   Compute Time = 2𝑚𝑛𝑘/𝐺𝐹𝐿𝑂𝑃𝑆 

!   % Compute = 1/1+ 8∗𝐺𝐹𝐿𝑂𝑃𝑆/𝐾∗𝑃𝐶𝐼𝑒  = 1/1+𝑥 , x= 8∗GFLOPS/K∗PCIe 

Fermi DGEMM Strategy   Slice Matrix into several pieces

  Use Stream API

—  Overlap COPY + Compute

Copy A H2D

Loop over pieces:

Copy Bi,Ci H2D

DGEMM A,Bi,Ci

Copy Ci D2H

CPU DGEMM A,B_last,C_last

DGEMM Compute/Copy Overlap

Additional Overlap strategy

 Copy A matrix can be significant for smaller matrix sizes

 Split A for additional Overlap

DGEMM Performance

0

20

40

60

80

100

120

128

320

512

704

896

1088

1280

1472

1664

1856

2048

2240

2432

2624

2816

3008

3200

3392

3584

3776

3968

4160

4352

4544

4736

4928

5120

5312

5504

5696

5888

6080

GFLOPs

Size

Xeon Quad-‐core 2.8 GHz, MKL 10.3Tesla C1060 GPU (1.296 GHz)CPU + GPU

0

50

100

150

200

250

300

350

400

450

500

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

GFLOPS

Size: N=M (K=1024)

Dual Quad-‐Core Xeon X5550 2.66 GHz 8 cores MKL 10.2.4.032 Tesla M2050 "Fermi" 1.15 GHz

Fermi DGEMM Performance

435 CPU+GPU

350 GPU

85 CPU

Optimizations – Auto Split !   Keep track of CPU and GPU performance and adjust split

! Wallclock() CPU time ! Cuda event record GPU time !   Compute optimal split for next iteration

cudaEventRecord(GPU_start,0);

Loop: launch GPU copys + Kernels

cudaEventRecord(GPU_stop,0);

CPU_start = wallclock();

Call CPU_DGEMM

CPU_stop = wallclock();

cudaEventSynchronize(GPU_stop);

GPU_GFLOPS = GPU_FLOPS/GPU_TIME

CPU_GFLOPS = CPU_FLOPS/CPU_TIME

SPLIT = GPU_GFLOPS/(GPU_GFLOPS+CPU_GFLOPS)

Results on single node Dual Intel Xeon X5560 2.8 GHz 96GB memory, 2 Tesla M2050

- Peak DP: 89 + 515*(2) = 604 (1119) GFLOPS - DGEMM ( 2/3 of peak on Fermi) : 89 + 350*(2) = 439 (789) GFLOPS

================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 108032 768 1 1 2011.42 4.179e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0039415 ...... PASSED ================================================================================

705.1 GFLOPS = 63% of “PEAK” or 89% of “DGEMM”

================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L2L2 108032 768 1 2 1192.13 7.051e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0040532 ...... PASSED ================================================================================

417.9 GFLOPS = 69% of “PEAK” or 95% of “DGEMM”

Results on clusters

T/V N NB P Q Time Gflops ------------------------------------------------------------------------------------------------- WR13C2L4 2359296 768 32 145 6886.10 1.271e+06 -------------------------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033806 ...... PASSED

53 systems with Tesla GPUs on latest Top500 (Jun 2012)

Several machines with thousands of Tesla GPUs

Results: TeraTF

! TeraTF is a 3D Euler Hydrodynamics solver !   2nd order Godunov-type scheme, 3rd order remapping !   Various Riemann solvers:

!   Exact ( used in SPEC benchmark ) ! Dukowicz ( used in benchmark from CEA ) !   Acoustic

!   Fortran 90 with MPI and OpenMP parallelization !   Porting to GPU done with CUDA Fortran

!   Part of SpecMPI 2007 Benchmark

CUDA Fortran !   PGI / NVIDIA collaboration !   Same CUDA Programming model as CUDA-C !   Program GPU in Fortran Syntax !   Strongly Typed – variables with device-type reside in

GPU memory !   Use standard allocate, deallocate !   Copy between CPU and GPU with assignment statement

( GPU_array = CPU_array ) !   Copy subset of arrays with interval notation

Data Layout ! Hydro_vars( vars, i, j, k) = Array of Structs

! Hydro_vars( i, j, k, vars) = Struct of Arrays ... ... ... ...

... CPU sequential access = efficient

GPU parallel “Warp” access = inefficient (uncoalesced)

GPU parallel “Warp” access = efficient (coalesced)

Data Layout

!   Pad leading dimensions !   Multiple of memory

transaction granularity !   Coalescing / alignment

!   Equalize Leading Dimensions !   Reduce register

pressure / simplify addressing

k

i

plane_stride

comm buffers

Hydro_vars

Cell Work

Node Work

TeraTF results !   240^3 Grid, SOD problem, Dukowicz and Exact Riemann solver !   8 MPI processes ( 2x2x2 ), 1 GPU per MPI process !   8 M2050 GPUs, Dual Quad-Core Xeon X5560 (2.66 GHz) !   4 GPUs share x16 PCIe Link (not optimal)

Accuracy of GPU and CPU results

! Dukowicz : GPU and CPU results are identical !   Exact : Difference in last digit ( POW ? )

START: CPU : GPU : Masse : 0.1082530240533624 : 0.1082530240533624 Energie totale : 0.2646185032453942 : 0.2646185032453942 END (Dukowicz): Masse : 0.1082530240536995 : 0.1082530240536995 Energie totale : 0.2646185032454266 : 0.2646185032454266 END (Exact): Masse : 0.1082530240536390 : 0.1082530240536390 Energie totale : 0.2646185032458129 : 0.2646185032458130

TeraTF Performance

15x

19x

TeraTF Performance

MPI tasks GPUs Time(s) Speed-up 1x1x1 1 691 1 2x2x2 8 112 6.1 3x3x3 27 42 16.4 4x4x4 64 24 28.7 5x5x5 125 16 42.3

Medium case of the SPEC MPI configuration (2403, Exact Riemann Solver) on Cray XK6

Hybrid approach !   Partition local domain

across GPU and CPU (openMP)

k

i

GPU

CPU

i

GPU CPU

X and Y Phase

Z Phase

TeraTF hybrid performances

1

3.96

6.4

21

27

1 CPU Core 8 OMP 8 MPI GPU GPU + 8 OMP

TeraTF Hybrid Performance

Conclusions •  It is possible to fully utilize the computational resources available on hybrid clusters • With the right software design, PCI-e transfer time can be hidden

•  Improved Power Efficiency

efficient utilization of computational resources in hybrid ... · • gpus are very attractive in...

Documents